I'll
Give You a Definite MaybeAn Introductory Handbook on
Probability, Statistics, and Excel

[This
handbook has been prepared by Ian Johnston of Malaspina University-College for
students in Liberal Studies. This text is in the public domain, released May
2000, and may be used, in whole or in part, without charge and without
permission, by anyone]

One
important job of statistics is the comparison between two different sets of
information about apparently different things between which there may be a
connection. For example, we might want to compare how information about people's
income compares with the information about their education, in order to explore
the claim that the more education one has, the greater the probability of a
better income. Or, alternatively, we might wish to compare information about
poverty with information about crime, to see if there is a possible link between
the two, or, again, compare information about cigarette smoking habits with
information about particular health problems (e.g., heart disease).

These
sorts of studies involve comparing two variables (e.g., income and crime,
smoking habits and health) in order to see if there might be some connection and
perhaps even a suggestion of cause. As a cigarette smoking habit rises, do
health problems also rise? As income decreases, does the frequency of crime
increase? As people grow older to they become less or more tolerant of others?

What
we are looking for in dealing with such questions statistically is what is
called a correlation, an apparent connection between the different values in one
set of data and in a second set of data, so that as the values in the first set
increases, the values in the second set also tend to increase (or, in the case
of a negative correlation, to decrease).

Here's
a simple example. In an English course, once the instructor has marked a set of
class essays, she has a record of the marks, one for each student. Then she
assigns a second essay and marks it. Now, she has a record of the marks for two
essays written by the same students.

The
instructor might wish to know whether the students who did well in the first
essay also tended to do quite well in the second essay, whether the students who
did well in the first essay tended to do badly in the second essay, or whether
there was no apparent connection between the results of the first essay and the
results of the second essay (such information would be a useful test of the
value of the first essay as a predictor of success in the second). In other
words, she might like to see whether there is any significant correlation
between the two sets of marks.

Common
sense suggests to us that some sets of results should be quite closely
correlated. For instance, we would expect the results students received in
English 111 to be positively correlated to the results in English 112 (so that
students who did well in the first should tend to do well in the second;
students who fared poorly in the first course should tend to fare poorly in the
second). Similarly, we would expect people who exercise regularly to have a
healthier cardiovascular system than those who do not or those who are wealthy
to spend more on consumer goods, and so on.

But
in some cases our common sense expectations may be wrong, and often we may be
interested, not just in whether there is a correlation or not, but in how strong
that correlation might be.

B.
Positive, Negative, and No Correlation

In
considering correlation we recognize three distinct possibilities. The two sets
of variables (that is, the two sets of data we are comparing, like exercise and
cardiovascular health) may show a positive correlation. That is, as the values
in the first set of data rise, the values in the second set of data tend to rise
as well (i.e., those who exercise regularly have a greater cardiovascular
health). Or the two sets of data may display a negative correlation. That is,
those who score higher on the first set of results tend to get lower scores on
the second set of results (e.g., the number of alcoholic drinks one has consumed
and one's ability to carry out a manual dexterity test). Or, finally, there may
be no correlation: the two sets of data do not appear to have any relationship.

Correlations,
as we shall see, can be weak or strong, that is, the relationship between the
two sets of values may be really significant or may be slight. A
correlation can also be perfect, that is, every change in one set of values may
show a exact corresponding change in the second set of values (the paragraphs
below contain some examples).

Correlation
is an extremely important analytical tool which enables us to begin to sort out
claims about important connections, which may or may not be true: the amount of
smoking and the incidence of lung cancer, HIV infection and the onset of AIDS,
the age of a car and its value, television programming of playoff games and
attendance at lectures, poverty and crime, IQ tests and income levels,
intelligence and heredity, age and mechanical skills, and so on. People make
claims about such matters all the time. The principle of correlation enables us
to investigate such claims in order to understand whether they are true or not
and, if true, just what the strength of that relationship might be.

In
the following section we will work our way through a few examples, using Excel
to do all the difficult work. At this point make sure you understand just what
the concept of correlation (positive, negative, zero) means.

C.
Self-Test on Correlation Terms

Indicate
for each of the pairs of variables listed below what common sense tells you to
expect in the way of an overall correlation between the two sets (positive,
negative, perfect, zero):

1.
height in centimetres and height in inches;
2. levels of income and amount of spending on consumer goods;
3. the age of a car and its value;
4. two scores thrown on two dice simultaneously;
5. snowstorms and attendance at college classes;
6. ability to see in the dark and the amount of carrots eaten;
7. daily cigarette consumption and cardiovascular health;
8. heights and weights of elementary school children;
9. weekly consumption of calories in food and drink and body weight;
10. age and physical strength in senior citizens.

D.
Working with an Example

Suppose
we wish to explore the claim that people's political participation increases
with the number of years they spend in school. We think this claim is true, but
we wish to substantiate it. In other words, we wish to see whether there is a
correlation between the participation in the political process and years of
formal education.

First,
we collect data from each person in the study. We devise a test to measure a
person's participation in the political process; we have people provide the
appropriate information on this variable, and we ask them the tell us their
years of formal schooling. This collection yields the following information:

Respondent

Political Participation Score

Years of Formal Schooling

Respondent

Political Participation Score

Years of Formal Schooling

A

1

5

K

7

13

B

2

8

L

8

14

C

2

7

M

8

15

D

3

9

N

9

15

E

3

10

O

7

13

F

4

11

P

10

16

G

5

11

Q

9

16

H

6

11

R

8

14

I

5

12

S

9

13

J

6

12

T

9

14

From
a glance at the figures, we might conclude that there seems to be some
relationship, for the higher scores in the middle column appear to be
accompanied, in general, by higher scores in the right-hand column. However, we
want to be sure about this point and not simply rely upon visual impressions
from such a table.

E.
Creating a Scatter Diagram in Excel

One
standard way visually to express the correlation between two sets of variables
is to draw a diagram in which each result is plotted on a standard X-Y graph.
The X-Axis represents the value of one variable and the Y-Axis the value of the
other variable. Each score in the set (in our example from Respondent A to
Respondent T) is plotted on the graph, so that a distinct point is located for
each member in the data set according to the two numerical values associated
with each respondent.

For
example, for the above data, we can construct a graph in which the value for the
Years in School is plotted on the X-Axis and the value for Political
Participation is plotted on the Y-Axis. Thus, we can locate each respondent's
position exactly on the graph. We will come up with twenty points.

The
following chart (created by Excel) illustrates the results. Notice that there
are 20 points plotted and that each corresponds to a particular Respondent's
scores in the two columns of the table above.

Notice
that two people have 16 years of formal schooling (two values on the vertical
line drawn through 16 on the X-Axis). Four respondents are shown on the graph
with a Political Participation Score of 9, two with a Political Participation
Score of 8, two with a Political Participation Score of 2, and so on. We call
such an illustration a scatter diagram or scatter plot. Make sure you are
familiar with what this graph represents and how to read it. Do not proceed
until you feel very comfortable with this visual presentation of information.

Look
for a moment at the overall shape of all the points in the chart. It seems to be
characterized by a generally linear cluster which rises upward and to the right.
This shape suggests strongly that as the Y values rise (i.e., go vertically
upward), the X values increase also (i.e., move to the right). This point is not
true for every single point in the cluster, of course, but as an observation
about the overall general shape of all the points, that claim seems to hold.
Thus, these results seem to show a positive correlation: the higher the value
for the years of formal schooling, the higher the score on the Political
Participation test—that seems to be the general trend of the entire range of
respondents (even if some of the plotted points do not follow that overall
trend).

F.
The Line of Best Fit (Regression Line)

One
way to emphasize the overall orientation of all the plotted points in the
scatter diagram is to draw through the plotted data a straight line which comes
as close as possible to all of the plotted points. Such a line is called the
line of best fit (or the regression line). If we do that for the above graph
(and Excel will draw such a line for us), we can recognize immediately the
overall orientation of the results as sloping upward and to the right.

Notice
that this chart presents the same data as the last chart, except that there is a
plotted straight line (of square points) through the points plotted from the
data we collected from the respondents (the diamond shaped points). This line,
which is derived from the plotted points, is the straight line which comes
closest to all the points which correspond to the measured data. It is called
the line of best fit or the regression line.

We
can see from this line of best fit that it very clearly slopes upward and to the
right. This confirms our sense from the overall shape of the cluster that it
indicates a positive correlation: an upward slope clearly demonstrates that as
the Y values increase the X values tend to increase.

We
can use the regression line, once we have drawn it, as a general predictor. That
is, with the line in place, we can use it to read off values for cases not
included in the study. For example, if we want to know what this data reveals
about the probable political participation score someone with 14 years of formal
schooling, we can read directly up from the value 14 on the X-axis, see where it
meets the regression line, and read horizontally across to the Y-axis. That
value in this chart is (about) 7.8. Thus, on the basis of this study, we can
offer a prediction that someone with 14 years of formal schooling will probably
have about 7.8 as a political participation score. Obviously, this is not a
sure-fire guarantee of such a score; it is, however, something of an educated
guess.

Using
the regression line in this manner obviously yields only approximate results,
and it may in some cases be misleading if we rely on it too heavily (you notice,
for example, that most of the plotted points, which represent the hard data, the
actual people in the study, do not fall directly on the line). It is, at best, a
rough guide in a study of the sort we have been considering.

A
Common Misuse of the Regression Line

It
is often very misleading to use the regression line to make predictions outside
the range of the data. For instance, I can extrapolate (extend) the straight
line, once I have drawn it, and then make predictions well beyond the plotted
points (say, about the political participation of someone with 21 years of
schooling). This is often an illegitimate procedure leading to very wrong
conclusions (as we shall see from a couple of examples).

We
can draw in a Regression Line by hand, visually estimating where it should come.
This is obviously a somewhat subjective procedure, and different people drawing
the line will come with slightly different results. Excel, however, will plot
the line for us very quickly (as it did for the above diagram). We will be
reviewing that procedure later.

Please
remember always that a positive correlation does not mean that there will no
exceptions to the overall trend. There may be one or more results which do not
fit the overall trend of the entire collection of data. Correlation deals with
the general tendency of the entire collection of data.

G.
An Example of a Perfect Correlation

Here
is another set of measurements. This time the middle column represents the
student's scores in percentages for class participation, and the column on the
right represents the same score out of 20.

Name of Student

Participation Mark (100)

Participation Mark (20)

Name of Student

Participation Mark (100)

Participation Mark (20)

ab

65

13

hi

72

14.4

bc

90

18

ij

82

16.4

cd

83

16.6

jk

94

18.8

de

74

14.8

kl

50

10

ef

84

16.8

lm

75

15

fg

78

15.6

mn

58

11.6

gh

60

12

If
we plot these points on a scatter diagram as before, putting the Participation
Mark (100) on the X-axis and the Participation Mark (25) on the Y-axis, then we
get the following diagram:

Notice
here that the general shape of the plotted points moves upward and to the right—indicating
once more a positive correlation. In this example, there is an added
observation: the data we plotted forms a perfect straight line, with every
plotted point from the data we collected falling exactly on the same straight
line. Clearly, if we wanted to draw a line of best fit here, it would coincide
exactly with all the data. Such a result produces what we call a perfect
positive correlation: every increase in the value of the X-axis brings about an
exactly corresponding increase in the Y-axis value. We would expect this result,
of course, because the values are basically the same measurement. A similar
result would occur if, for example, we plotted the weight of ten students in
grams against the weight of the same students in pounds or a range of
temperature readings in Celsius against the same readings in Fahrenheit.

Here
is another pair of sets of variables. The first indicates the distance traveled
in a new prototype car through the desert at an even speed. The second column
indicates the amount of gasoline left in the car's gas tank at each of the
inspection points.

Place

Distance from Home (Miles)

Gasoline Left in the Tank
(Gallons)

Place

Distance from Home

Gasoline Left in the Tank
(Gallons)

Home

0

15

Deadman

425

6.5

Podunk

50

14

No Mo Gas

540

4.2

Buzzard Breath

150

12

Last Chance

680

1.4

Paradise

275

9.5

Hope Gone

700

1

Last Hope

350

8

Nowhere Town

750

0

Once
again we can illustrate these values in a scatter plot diagram to observe the
nature of the correlation:

Here
we see again that we are dealing with a perfect correlation, because all the
plotted points fall on the same straight line. But this time the line slopes
downward and to the right. This indicates that the highest values on the Y Axis
are associated with the lowest values on the X-Axis, or that as one value
increases the other decreases (as the distance increases the gasoline left in
the tank decreases). This example illustrates a perfect negative correlation.

Just
as in the first example (about political participation scores) we can have a
positive correlation which is not exact, so we can find a negative correlation
which is not exact. In such a case the plotted points will not all fall on the
same straight line, but the general shape of the cluster (and the regression
line we draw through them) will slope downward and to the right.

For
example, if we plotted a set of figures indicating the amount of alcohol
consumed against a second set of figures indicating success in a simple test of
physical dexterity, we would expect to get a negative correlation: the more
drinks consumed, the lower the score in the physical dexterity test. Here is an
example of such a scatter plot.

The
shape of this data in the scatter diagram indicates what we would expect, a
negative correlation. As the values on the X-Axis (the number of drinks)
increases, the value on the Y-Axis (the physical dexterity score) decreases

H.
Zero Correlation

When
there is no correlation at all between the two sets of variables, then we will
have trouble recognizing an upward or downward overall shape to the plotted
data, and the line of best fit will fall somewhere in a horizontal position
among the plotted points.

For
example, here are two sets of variables: the first lists the number of letters
in the name of geographical location in Canada, and the second lists its
longitude (approximately). We want to know if there is a correlation between how
many letters there are in a Canadian place name and its geographical location.
Our common sense suggests that there should not be a correlation, but we just
want to check

Name

Number of Letters in the Name

Longitude

Name

Number of Letters in the Name

Longitude

Prince Rupert

12

130

Biggar

6

108

Courtenay

9

125

Pickle Crow

10

90

Kirkland Lake

12

80

Williams Lake

12

122

Brandon

7

100

Yarmouth

8

66

Glace Bay

8

60

Sherbrooke

10

72

Clarenville

11

54

Oba

3

84

Dundas

6

80

Calgary

7

114

Lac Ste Jean

10

72

If
we plot these points on an X-Y graph to produce a scatter diagram, we can then
inspect the distribution and draw in a line of best fit. The plotted data looks
as follows:

Try
to estimate where you would draw the regression line through these plotted
points. Notice that one cannot so easily produce one like those we have already
drawn, that is, with a distinct slope indicating a positive or negative
correlation. And, thus, based on the very small sample we chose, from a visual
inspection of the plots, there does not seem to any relationship between the
number of letters in a Canadian place name and its approximate longitude.

The
scatter plot diagrams and the regression line give us a general visual idea of
what the correlation is between two sets of variables. In many cases, however,
we require a more accurate measure, so that we can compare one correlation with
another. In other words, knowing that a correlation exists is valuable; more
valuable, however, is to know the size of that correlation. Such a mathematical
measure of the correlation between two sets of variables is called the
Correlation Coefficient. It is most commonly symbolized by the letter r.

The
following paragraph outlines how the Correlation Coefficient is calculated.
However, this method is something Excel will do for us, once the data is
entered. The method is listed here for interest:

1.
First we transform the scores in each of the sets of data into z-scores.
Remember that a z-score is a measure of how far any particular score is
from the mean of the entire set and that the units of z-scores are
standard deviations (i.e., a z-score of 2.5 means that this particular value is
2.5 standard deviations above the mean; a z-score of -1.8 means that this
value falls 1.8 standard deviations below the mean)

2. We
then multiply together the corresponding z-scores in each list (i.e., the z-score for one measurement is multiplied by the z-score for the
corresponding measurement from the other set of data. Remember that a
positive number multiplied by a negative number produces a negative result; a
negative number multiplied by a negative number produces a positive result.

3.Then,
we add up all the results produced by Step 2 above.

4. Finally,
we divide the figure obtained in Step 3 by the total number of pairs of scores
(i.e., get the average of figures derived in Step 2.This process will always
produce a number between -1.00 and +1.00. This number is called the correlation
coefficient (symbolized by the letter r). If r = -1, then the correlation
is a perfect negative correlation; if r = 1, then the correlation is a
perfect positive correlation. If r = 0, then the correlation is zero (no
relationship between the variables).

J.
Exercise in Calculating the Correlation Coefficient with Excel

Here's
an exercise to demonstrate a practical application of correlation and to show
how Excel will do all the mathematics for us.

Suppose,
as teachers or employers, we are interested in whether a particular diagnostic
test we have been using is a good predictor of success. In other words, does the
mark on an entry-level test give us reliable information about how well a
particular person will do in, say, an academic program? One way of answering
this question is to use correlation.

Here
is a chart of data we collected on ten subjects (not a large enough sample, but
it will enable us to go through the procedures for calculating correlation).
Each one wrote the same diagnostic examination to get into a course, and each
one wrote the same final examination eight months later.

Name

Diagnostic Score

Final Mark (100)

Name

Diagnostic Mark

Final Mark (100)

ab

76

80

fg

64

70

bc

85

76

gh

85

83

cd

86

62

hi

89

84

de

75

84

ij

71

70

ef

84

87

jk

87

80

Enter
this data onto a new Excel worksheet, putting the results of the diagnostic test
in Column A (from A1 to A10) and the results on the Final Mark in Column B (from
B1 to B10). To get Excel to conduct a correlation analysis on the data, carry
out the following steps.

1.
First, point and click on the Tools option on the top line, and from the
drop-down menu select Data Analysis, which is normally the last item on the
menu. Note that if Data Analysis does not appear you will have to add that to
the Tools option (using the process explained in a previous section, in Part J). When the
Data Analysis menu appears left click the mouse on the option Correlation.

2. You
will then get a box asking you for the Input range and the Output range. The
Input range is the group of cells you want in the analysis (in this example the
ten cells in Column A and the ten cells in Column B). Enter this information in
the Input Range box, not forgetting the dollar signs which indicate a range:
$A$1:$B$10; this command is telling Excel to include in the analysis the
rectangular block of cells defined by A1 and B10 at the corners. You do not have
to enter the range manually (by typing in the figures), if you click the left
mouse button with the cursor in the Input Range box and then select the entire
table of data, the range figures ($A$1:$B$10) will appear. If you have put the
names in column A and the headings in row 1, do not select them for analysis.
Select only the numbers.

3.
Then, click the left mouse button in the circle to the left of the label Output
Range (unless there is a dark dot in the circle already to indicate that that
option has been selected). Then click the left mouse pointer in the Output Range
box (the long horizontal blank rectangle). Once you have clicked the mouse
button in that box, enter the name of the cell where you want the data to appear
(once again the top left corner of the table you are about to generate). Let's
choose cell D1. So in the Output Range box type in $D$1. Then click on OK. Once
again, you do not have to type that in the Output Range box, if you click the
left mouse button with the cursor in the Output Range box and then select cell
D1.

4.
After a moment, on the worksheet a small table will appear with its top left
hand corner in Cell D1. This table indicates the correlation between the
columns. In the case of our example, the correlation between Column 1
(Diagnostic Score) and Column 2 (Final Mark) is 0.31 (or r = 0.31). Thus, in
this case the mathematical calculation carried out by Excel reveals a small
positive correlation between the results on our diagnostic test and the results
on the final mark.

Notice
in the small table we generated in this exercise that the correlation of Column
1 and Column 1 is 1 (i.e., perfect), as we would expect, since we are comparing
that column with itself. The same is true for the correlation between Column 2
and Column 2.

K.
The Strength and Significance of a Correlation Coefficient

Now
that we have a mathematical value for our correlation coefficient (r = 0.31),
how are we to interpret that? Does this mean that our diagnostic test is a good
one, that we can rely on it?

As
mentioned earlier, the strongest positive correlation is 1.0, and the closer our
r value is to 1.0 the stronger the correlation between the two sets of values we
are analyzing. The closer to 0 our r value, the weaker the correlation. The
following general categories indicate a quick way of interpreting our calculated
r value:

Our
result of 0.31 is thus very weak or low; as an evaluative tool our diagnostic
test is not all that useful, since the results on it do not reveal very much
about the future results in the final marks. There is some positive correlation,
but not enough to make the diagnostic reliable as a predictor (say, for
counseling students or alerting the teacher to potential problem students).

One
useful rule of thumb for estimating the importance of the r value is to
calculate the square the correlation coefficient (i.e., calculate r2). This
squared result will give us a rough percentage for the amount of variation in
the final result which is directly attributable to the other variable.

For
example, suppose our diagnostic test was a good measure of writing ability. The
correlation between that writing test and the final marks is 0.31. If we square
this value, we get the value 0.096, or 96 in 1000 or 9.6 percent. On the basis
of this, we can claim that 9.6 percent of the students' success in the course is
attributable to the writing skills they had at the very start of it (i.e., what
we measured in the our diagnostic test). We would then have to conclude that
success is this course is not very dependent upon the writing skills the
students have upon entry.

Suppose
now we administer another diagnostic test at the very start of the course, this
time measuring reading ability. And we discover when the course is over that the
correlation between the results of that diagnostic test and the final marks for
the course is 0.8 (r = 0.8). If we square 0.8, we get 0.64 or 64 percent. We can
then claim that 64 percent of the students' success in the course is directly
attributable to the reading skills they possessed (and which we measured) when
they started. Obviously, this conclusion assumes that the diagnostic test is an
excellent measure of the students' reading ability.

It
will be apparent that correlation analysis is a very useful tool for
investigating all sorts of claims. We may often think that one particular skill,
quality, or value has a direct and important effect upon another (i.e., years in
school and income at age forty, frequency of smoking and incidence of heart
disease, success in a particular test and success in something different, and so
on). Correlation analysis enables us to test such claims and to provide some
quantifiable measure to them.

L.
Some More Examples

Here
is another common example of the usefulness of correlation. Suppose we are
interested in dealing with a question of heredity: Is there a relationship
between the height of mothers and the heights of their sons and, if so, how
strong is it? Well, we carry out a series of measurements of the heights of
mothers and sons and conduct a correlation analysis of the results. Suppose we
discover that the correlation coefficient in our study is 0.7 (i.e., r = 0.7).
If we square this figure, we get 0.49, or 49 percent.

This
means that 49 percent of the variation in the sons' heights can be attributed to
the heights of their mothers. This correlation is moderate and indicates that
predicting the height of a son on the basis of his mother's height is moderately
successful. The remaining 51 percent of the factors influencing the sons' height
come from elsewhere (e.g., the fathers, environment, food, and so on).

Suppose
further that we now conduct a correlation study with the same sons, but this
time analyzing the relationship between their heights and their fathers'
heights. We find that r = 0.8. If we square this we get 0.64. We can thus
estimate that 64 percent of the sons' variation in height is attributable to
their fathers' heights.

You
might be drawn into some confusion here: if 64 percent of the sons' heights is
attributable to the fathers' height and 49 percent to the mothers' heights, then
we have more than 100 percent. How can that be? Well, in some cases the heights
of the mother and the father act together in determining the height of the
child. Statistically speaking, the heights of the fathers and the heights of the
mothers are also correlated, since people tend to marry people closer to their
own height—or taller women tend to marry taller men and vice versa, and
shorter women tend to marry shorter men, and vice versa. So the total effect is
more complex than just adding together the two separate percentage
contributions.

M.
Self-Test on Excel's Correlation Function

We
want to track the progress of a class of ten students in a particular class of
English 111, in order to ascertain if our ways of predicting success in the
course are useful. We have a record of their Grade XII marks, and we administer
a diagnostic reading test in the first class. At the end of the semester, we
have the final marks of the students. We now wish to know which of the two
assessments—the Grade XII mark or the Reading Test Mark—provides a better
guide to the students' success in the English 111 class. We will assume for the
sake of this exercise that the sample is an accurate one (even though it is too
small)

.Here
is the raw data, presented in tabular form. Copy this information onto a fresh
Excel worksheet, putting the names in Column A, the Grade XII results in Column
B, the Diagnostic Test Marks in Column C, and the Final Marks in Column D.

Name

Grade XII Mark (100)

Diagnostic Reading Test Mark
(100)

Final Mark (100)

ab

75

75

75

bc

72

69

76

cd

82

76

83

de

78

77

65

ef

86

79

85

fg

76

65

79

gh

86

82

65

hi

89

78

75

ij

83

70

80

jk

65

71

70

When
you have entered the information, select the three numbered columns (do not
include any cells with words in them). Go to the Data Analysis option on the
Tools menu, select from that Data Analysis menu the item Correlation (note, once
again, that if the Data Analysis option is not on the Tools menu you have to add
it in).

When
you get the Correlation menu, enter in the Input Range the block of cells you
wish to analyze (i.e., from B2 to D11, if you have used Column A for the names
and Row 1 for the titles). Do not forget to put in the dollar signs.

Then
click the mouse pointer in the circle to the left of the Output Range label
(unless there is a black dot in it already), and click the left mouse button in
the Output Range box. Then enter the name of cell where you want the top left
corner of the correlation table to appear (e.g., $A$13). Then click OK.

After
a second or two, the Correlation Table should appear giving you the correlation
between all the different pairs of data. We are interested in the correlation
between Column B (the first column in the Table) and Column D (the third column
in the table) and between Column C (the second column in the Table) and Column
D. Which of these two is the better predictor of success according to this
study. How reliable is it? For the answers see the paragraphs at the end of this
section.

Using
this procedure we can examine the correlations between several sets of data
entered on an Excel worksheet. For example, if we have all the marks for all the
Liberal Studies assignments entered on a single sheet, together with the final
mark, we can simultaneously calculate the correlation between the marks for any
particular set of assignments and the final result. This would, among other
things, enable us to estimate which assignments were the best and worst
predictors of success in the program.

N.
A Strong Caution

It
is very important to grasp the point that a correlation, even a very strong
correlation, does not enable us immediately to make a conclusion about
causation. If, for example, we find a very high correlation between the number
of years people spend in post-secondary education and their income at age 40, we
may make some predictions about income at age 40 based on years of
post-secondary education or we may urge people to stay in university because the
more they study at university the higher their eventual income will tend to be,
but we should be aware that the correlation, in itself, is no proof of these
assertions.

This
is a vitally important principle: correlation is not necessarily a proof of
causation. It indicates a relationship which may be based on cause and effect,
but it may not be. If A is a major cause of B, then we can expect that
variations in A will cause changes in B (i.e., there will be a correlation). The
reverse, however is not necessarily true. If X and Y are correlated, we cannot
automatically assume that X is the cause of Y.

This
issue is an important point of contention in the political disputes about AIDS.
There is a very high correlation between HIV infection and the occurrence of
AIDS, and thus many researchers have from the start assumed that HIV is the
principal cause of AIDS. On the basis of this assumption, most of the research
money for AIDS has gone into investigating HIV. Yet it is still not clear what
causes AIDS, and some people (especially those suffering from AIDS) have argued
very strongly that investigating HIV instead of AIDS is a mistake: the real
cause is something else, something which requires much more money going directly
into research on AIDS rather than into HIV. The social, political, medical, and
financial consequences of this argument are substantial.

Suppose
we do find a significant positive correlation between two variables, X and Y
(for example between provincial government expenditures on education and the
average income of citizens), we recognize that there may be four different
reasons why this relationship exists:

1.
First, X may indeed cause Y. That is, the fact that the provincial government
spends more on education is indeed the reason that more citizens stay in
university longer and get higher paying jobs when they graduate. Thus, the
average income of the province increases.

2.
Secondly, however, the result may be just chance. This is quite possible if we
are sloppy about our sampling (we haven't talked about that). But if we make
many observations and sample the population correctly and repeatedly, this
reason for the correlation becomes very unlikely (the more the tests the less
likely that the results occur by chance).

3.
Thirdly, there may be some third factor we have not taken into account which
produces the variation in both X and Y and which is the real cause of the
correlation. This conclusion seems to be emerging in the analysis of AIDS and
HIV (1).

4.
Finally, there may be a causal connection which is responsible for the
correlation, but we may have put it the wrong way around. We might conclude from
the high correlation between the government's expenditures on post-secondary
education and average family income that the expenditures on education are
causing that increase in income. But it may be the case the higher family
incomes are caused by something else; they are providing the government with
increased tax revenue, which the province (for any number of reasons) has
decided to spend on post-secondary education (thus making education more
accessible and affordable and, hence, more popular).

Does
this ambiguity in interpretation mean, then, that correlation studies are of
limited value? By no means. It simply reminds us that we have to be careful
about the conclusions we draw from a correlation study. Remember this important
point: every time a particular factor (A) is a cause of something else (B) there
will be a correlation between them. Thus, we can use correlation as a negative
test. If there is no correlation between two variables, we can assume that one
of them is not a cause of the other. If there is a correlation we have strong
evidence that one of them may be the cause of the other and that we should,
therefore, investigate further.

O.
Self-Test on Correlation Charting and Calculations

Here
are the numerical results of a hypothetical study in which various skills
measured independently are to be analyzed for a possible correlation with annual
earnings among a group of salesmen in similar industries. Enter these figures
onto an Excel worksheet, and generate a correlation table for all the data.
Which of the four measured qualities has the strongest positive correlation with
the annual salary?

Name of Salesman

Public Speaking

Writing Ability

Math Skills

General Knowledge

Annual Earnings

ab

76

64

53

77

80,000

bc

56

64

89

75

60,000

cd

77

70

60

70

72,000

de

90

95

62

82

92,000

ef

68

68

88

68

65,000

fg

83

85

61

85

90,000

gh

61

60

79

85

54,000

hi

79

73

82

68

80,000

ij

64

83

59

81

65,000

jk

88

62

89

66

79,000

When
you have discovered the quality which appears to have the strongest correlation,
produce a scatter plot diagram to illustrate the relationship between this
quality and the annual salary (you will be dealing with only two columns of
data: the figures for the column representing the highest correlation and the
annual earnings).

If
you wish to produce a scatter plot with a regression line, then follow the
procedure given below:

1.
Select Tools and then Data Analysis. And from the options select Regression. In
the Input Y Range Box, enter the range for one column of scores. In the Input X
Range, enter the range for the other set of scores.

2.
Click the circle to the left of the label Output Range (so that there is a black
dot in it), then enter into the Output Range box, the cell where you want the
top left corner of the data to appear.

3.
Make sure you then you select Line Fit Plots by moving the cursor to the square
box beside the label and clicking the left mouse button once. There should now
be a check mark in that box. Then click OK.

4.
Now a very complex looking table will appear. If you move to the right of the
worksheet, you will see a small chart. Select the chart and enlarge it. You will
notice that the chart is a scatter plot diagram with a regression line plotted
through the points (similar to the diagram in Section F above).

P.
Transferring an Excel Diagram to Another Windows Application

When
you have created a chart in Excel, you can, as mentioned above, print the chart
directly from Excel. However, in many cases you may want to transfer the chart
into WordPerfect or Word document, so that the illustration appears in the
middle of your own text (as in this module).

This
procedure is fairly easy to do, so long as both Excel and the word processing
program are in Windows. Here is an outline of the simplest procedure.

Create
or call up the chart in Excel. Make sure the chart has all the shadings and
headings and borders and legends you want. Once the chart has been formatted to
your satisfaction, select the entire chart by the method outlined earlier (so
that the outer perimeter of the chart is demarcated by small squares). You can
alter elements of the chart once it is in your word processing document, but the
options are more limited and sometimes more difficult to carry out than in
Excel. So make sure the chart looks just as you want it to in the word
processing document before you transfer it.

Important
Warning

Before
you start trying to move material (e.g., charts) from one document to another,
make sure you save the Excel worksheet which contains the chart. It is not
uncommon for mistakes to occur the first couple of attempts, and you will save
yourself considerable time if you can go back to the original Excel chart and
start the transfer again, rather than having to generate the chart all over
again.

Then,
with the chart selected, point the mouse arrow onto the Edit option at the top
of the screen, click the left mouse button, and from the drop-down menu select
Copy. Once you do that, you will notice that the chart now has a flashing dotted
line around it (indicating that it has been selected for copying).

Now,
place the mouse arrow directly on the button with a negative (minus) sign in the
very top left of your Excel window (just above and to the left of the word
File). From the drop-down menu, select Minimize. Your Excel document will then
disappear, and a small Excel icon will appear on the bottom left of your screen,
with the name of the Excel file containing the chart under the icon. Whenever
you want to return to the Excel chart, all you have to do is click the mouse on
this Excel icon.

Now
call up the word processing program you want from the Windows Applications menu.
Once in the right program, call up the document in which you wish to place the
chart, and move through the document until you have the exact place where you
wish the chart to appear. Place the cursor in the exact place.

Then
from the main menu at the top, select Edit, and from the drop-down menu select
Paste. The chart should now appear, with the upper left hand corner in the spot
where you placed the cursor. Note that if you have asked for the chart to be
placed on a page where there is insufficient room for it, the Paste command will
place the diagram on the next page (i.e., the diagram will not go over a page
break).

Sometimes
in the transfer from Excel to the word processing program, minor formatting
problems may arise. Inspect the chart carefully once you have transferred it. If
everything is as it should be, fine; if not, then you may need to edit the chart
further in the word processing program. To put the diagram into the edit mode,
double click the left mouse button with the mouse pointer on the diagram. In the
edit mode you can make adjustments to the chart.

For
instance, a common occurrence when you move an Excel diagram into Word is that
the units of the X-axis become aligned incorrectly. You can deal with this in
the Word diagram edit mode by selecting each figure on the X-axis and moving it
back to the position you want it. To move a number, simply point to it with the
mouse, hold the left button down. The number will change to a dotted rectangle
which (with the left mouse button still held down) you can move up or down, to
the left or to the right, as you wish.

In
the same way, you can alter the text in the diagram once it is in your word
processing document. When the diagram is suitable and you wish to exit from the
edit mode, return to the normal document via the file menu on the diagram edit
program (not the main file menu at the very top).

Note
that in your text you can alter the size and the position of your chart. If you
point the mouse indicator onto the diagram and click once, you will see that the
chart has a perimeter line with some squares half way along each side. If you
position the mouse indicator on one of those squares, the indicator will turn
into a double arrow; press down the left mouse button and drag the side of the
chart in the direction to want (to enlarge or reduce its size)

To
centre the chart on the page, select the diagram, and choose the centre command,
either from the Format-Paragraph option or from the centre button on the
toolbar.

To
move the chart to another place in the document, select the entire chart and
then use the cut and paste function.

Q.
Answers to Self-Test Questions

Self-Test
On Excel's Correlation Function (Section M)

The
correlation between the Grade XII Mark and the Final Mark is 0.23; the
correlation between the Diagnostic Reading Test and the Final Mark is -0.28.
Thus, the Diagnostic Reading Test has a negative correlation; the Grade XII Mark
a positive correlation with the Final Mark, the negative correlation being
slightly stronger than the positive correlation. However, both figures are so
low, that the correlation is minimal. The skills measured by the Grade XII marks
account for about 5 percent of the skills measured by the Final Mark.

Self-Test
on Correlation Charting and Calculations

The
correlation between the first column (Public Speaking) and Annual Earnings is
0.91, an extremely high reading. The scatter plot and regression line for these
two variables looks something like the following:

Notes
to Section Four

(1)
This point was the subject of a very famous and controversial study of education
and income carried out by Christopher Jencks. He analyzed the common claim that
income at age 40 is directly correlated to the number of years of schooling (a
very strong case for staying in school). Jencks argued that that correlation
was, in fact, produced by a much more important factor, namely, the income of
one's parents. People whose parents had more money tended to stay in school
longer and tended to have higher incomes at age 40. Jencks concluded that the
common claim about staying in school increasing one's earning power (i.e.,
causing an increase in income) was unsubstantiated. The real cause was parents'
income. Not surprisingly, the book (Inequality) was very hotly debated. [Back
to Text]