Scattergrams

Now let’s extend the comparison so that we are comparing several items, not just two. In this case, we won’t have to presume that there must be a correlation – we will be able to see whether there is one or not! Here is a table showing the results of two examinations set to students that I teach. I set them a maths exam and an English exam and record the scores that they get in both:

John

Betty

Sarah

Peter

Fiona

Charlie

Tim

Gerry

Martine

Rachel

Maths score

72

65

80

36

50

21

79

64

44

55

English score

78

70

81

31

55

29

74

64

47

53

We take a piece of graph paper and draw two axes. The horizontal axis will represent the score on the English exam. The vertical axis will represent the score on the Maths exam. For each student, we then mark a small dot at the co-ordinates representing their two scores. Below, I have done this:

You can see that the points follow a fairly strong pattern. People who are good at maths tend to be good at English as well. The marks lie fairly close to an imaginary straight line that we can draw on the graph. In the diagram below, I have drawn in this straight line, and also included another point (in red) which I will explain later:

The fact that the points lie close to the straight line is called a strong correlation. The fact that this line points upwards to right – indicating that the English mark tends to increase as the maths mark increases – is called a positive correlation.

Line of Best Fit (Regression Line)

The straight line that we draw through the points is called either the line of best fit or the regression line. It describes the relationship between the two variables (the quantities compared) mathematically. There is a standard way to draw this line to ensure that it fits as closely to the data points as possible. Later on, we will investigate exactly what that mathematical way is. For now, we only have to remember one thing:

The regression line goes through the point whose co-ordinates are the mean values of the variables

The arithmetic means are found by adding the relevant scores, and dividing by 10. This is because there are ten students in the table. We work out the arithmetic mean of the maths scores …

and we can be sure that the line must go through the point (56.6, 58.2). This is the point marked in red on the graph above. You will notice that there are roughly the same number of data point lying above this line as there are below it.

We can use the regression line to make predictions. For instance, what English mark would we expect someone to receive if they received a maths mark of 30. If we look at the straight line, we can see that when the maths mark is 30, the English mark is approximately 28. Similarly, we can assume that anyone who got an English mark of 40, would also get a maths mark of about 40. However, there are limits on the predictions that we can make, as you will see later on.

Negative Correlation

In the following table, I have duplicated the maths marks for the ten students and this time added the number of absences from maths lessons for each student:

John

Betty

Sarah

Peter

Fiona

Charlie

Tim

Gerry

Martine

Rachel

Maths score

72

65

80

36

50

21

79

64

44

55

Absences

4

6

0

13

8

15

2

3

9

5

In this case, the scattergram looks like this. I have added the regression line. Again, there is a good correlation between the maths scores and the absences from maths lessons, except that as the number of absences increases, the maths score goes down. This is referred to as negative correlation. Again, we can use the line of best fit to make predictions. What score would a student have received if he had been absent 10 times. According to the graph, it would have been about 41. If a student received a mark of 30, how many times would you expect him to have been absent? From the graph, it seems to be about 13 times.

However, this graph shows well the limitations of making predictions. What score would someone have received if they had been absent for all 30 maths lessons? According to the graph, the score would be less than zero! Similarly, how many times would a student have had to be absent in order to gain a score of 90? Well, the line hits the horizontal axis when the score is just over 80, so in order to get a score of 90, a student would have to be absent a negative number of times. Clearly, these conclusions are stupid, and they lead us to another general principle:

You can only use linear regression to draw conclusions about values within the range of the data point themselves. You might just be able to get away with drawing conclusions about values just outside that range, but the further away from the data range you move, the less reliable the conclusions become!

No correlation

Finally, one more table, this time showing the English marks compared with the average length of time the students spend travelling to college each morning, recorded in minutes.

John

Betty

Sarah

Peter

Fiona

Charlie

Tim

Gerry

Martine

Rachel

English score

78

70

81

31

55

29

74

64

47

53

Time

12

32

19

31

30

15

22

10

17

16

In this case, the scattergram shows no particular pattern. It is clear that we can’t draw a straight line anywhere near the data points, and we say that there is no correlation between the length of time taken to travel to college and the final English mark that a student gets. We cannot predict the English mark of any student based on how long it takes him to get to college. Nor can we predict how long it takes a student to get to college given that student’s English mark.

Non-linear correlations

A bus company wanted to discover if there was any relationship between the number of buses it ran and the number of complaints it received. It carried out a survey testing the average number of buses per hour for different days, and the number of complaints that it received on those days. Here are the results:

As you can see, there is a negative correlation between the number of buses per hour and the number of complaints, but in this case, a curved line fits the data better than a straight line. We are about to investigate the rule that lets you fit a straight line to the data points – it is enough to say at this point that similar rules exist which let you fit various curved lines to the data points as well.