User menu

Regression

You are here

Regression

Sometimes in statistics we need to compare 2 sets of data from the same source, we can do this by means of a scatter diagram. To plot a scatter diagram for 2 sets of data x and y, we plot each pair of corresponding points. Data like this is called bivariate data.

Example:

The following table gives the test results for the first 2 tests on S1, probability and discrete random variables, for 12 sixth form students...

Student:

1

2

3

4

5

6

7

8

9

10

11

12

Prob (x):

65

88

83

92

50

67

100

100

73

90

83

94

D.R.V (y):

52

57

78

76

30

67

96

74

65

87

78

89

To plot the scatter diagram, we plot their probability score on the x-axis against their discrete random variable score on the y-axis.

Therefore, plot (65, 52) and (88, 57) etc... for all 12 students.

As you can see from the diagram, there appears to be a trend in the scatter. The points seem to lie along the same diagonal line, this is called the 'line of best fit'. There are of course the obvious exceptions which seem to lie a little too far from the line (these are ringed on the scatter diagram).

The values are given by

Remember: ∑ means 'sum of'.

In this example,

If all (or nearly all) of these points seem to lie in a straight line then there is said to be a linear correlation between x and y. (Correlation is like a link or connection.)

If our scatter diagram shows a correlation between the 2 sets of data then we can add a line of regression. These are pretty much 'lines of best fit' (as seen above) but are more accurately calculated. However, unlike GCSE, if we have a fair degree of scatter we often draw/calculate 2 regression lines.

These lines are:

Regression line y on x

Used to estimate y, taking x to be accurate. This line is calculated by finding the least sum of the squares of the vertical distances from the points. Let's look at the following diagram to explain this...

The vertical distance from each point to the line is squared and added to each other result. The line that has the least total will be the regression line y on x.

Regression line x on y.

Used to estimate x, taking y to be accurate. This line is calculated by finding the least sum of the squares of the horizontal distances from the points. The following diagram will explain this further...

The horizontal distance from each point to the line is squared and added to each other result. The line that has the least total will be the regression line x on y.

Note the similarities between Sxy, Sxx and Syy in these formulae. The formulae may look tricky but in actual fact are quite easy and straightforward to use. The following example will demonstrate this.

Example:

We will use the data seen earlier of the test results for the first 2 tests on S1, probability and discrete random variables, for 12 sixth form students. We will then calculate the regression lines x on y, and y on x.

With the above data, x looks to be controlled, where y appears to be dependent on an experiment and x. In this case, we say that x is an independent variable and y a dependent variable. As x appears controlled and accurate we only need to calculate the regression line y on x.

In the example above we were given the raw data, which we calculated our summarised data from. We did this by adding extra columns to our raw data and then systematically working out each value of x2, y2 and xy before adding each row.

This method is not only time consuming but incredibly boring!

The quickest way to deal with our raw data is to plug it straight into our calculators and let our calculators work out not only the summarised data, but also the values of a and b, or a| and b|.

Your calculator will need to be in linear regression mode. You will need to work out for yourselves (with help from your teachers) how to use this function as most calculators use different keying sequences.

Warning: In A-level questions you could be given raw data or summarised data. This means you should have the knowledge how to use your calculators for the raw data, but also a working knowledge of the formulae. Do not just rely on the calculator method.