Scatterplot, Correlation, and Regression on TI-89

Summary:
When you have a set of (x,y) data points and want to
find the best equation to describe them, you are
performing a regression.
You will learn how to find
the strength of the association between your two variables
(correlation coefficient), and how to find the
line of best fit
(least squares regression line).

Usually you have some idea that your x variable
can help predict your y variable, so you
call x the explanatory variable and y the
response variable. (Other names are
independent variable and dependent variable.)

Step 0. Setup

The calculator will remember this setting when you turn it
off: next time you can start with Step 1.

Step 1. Make the Scatterplot

Before you even run a regression, you should first plot the points
and see whether they seem to lie along a straight
line. If the
distribution is obviously not a straight line, don’t do a linear
regression. (Some other form of regression might still be appropriate,
but that is outside the scope of this course.)

Let’s use this example from
Sullivan 2011 [full citation at https://BrownMath.com/swt/sources.htm#so_Sullivan2011], page 179: the
distance a golf ball travels versus the speed with which the club head
hit it.

Club-head speed, mph (x)

100

102

103

101

105

100

99

105

Distance, yards (y)

257

264

274

266

277

263

258

275

Turn off other plots.

[◆] [APPS] and select
Stats/List Editor.

[F2] [3] [F2] [4] turns off all plots and functions.

Enter the numbers in two statistics lists.

You will use two named lists for the x’s and
y’s. Any names are possible, but I’ll use
lx and ly because they’re short.
If those lists already exist, highlight the lx name
and press [CLEAR] [ENTER] to erase previous entries.
If lx isn’t there yet, move to an empty list
heading and press [L] [X]. (L is above the 4 key. When you
press 4 while naming a list, it will change to L automatically.)

Note: You can hide an unwanted list by cursoring to the
list name and pressing [◆←makesDEL]. The list
remains in memory until you use [2nd−makesVARLINK] to delete
it.

Set up the scatterplot.

[F2] [1] [F1] opens a dialog box. You want these
settings:

Plot type: Scatter

Mark: anything except dot (because a data dot looks just like
a dot on the grid)

X: [alpha] [L] [X]

Y: [alpha] [L] [Y]

Use Freq and categories: NO

Press [ENTER] to complete the definition.

Plot the points.

[F5] automatically adjusts the window
frame to fit the data.

(optional)You can adjust the grid to look better.

[◆F2makesWINDOW], set Xscl=1
and Yscl=5, then [◆F3makesGRAPH] to
redisplay it.

Appropriate values of Xscl and Yscl may
be different for other problems. Pick the values that make the
graph look best to you.

Check your data entry by tracing the points.

[F3] shows you the first (x,y) pair, and then
[►] shows you the others. They’re shown
in the order you entered them, not necessarily from left to right.

A scatterplot on paper needs labels (numbers)
and titles on both axes; the x and y axes typically won’t start
at 0. Here’s the plot for this data set. (The horizontal lines
aren’t needed when you plot on graph paper.)

When the same (x,y) pair occurs multiple times,
plot the second one slightly offset. This is called jitter.
An example will be shown in class.

If the data points don’t seem to follow a straight line
reasonably well, STOP! Your calculator will obey you if you
tell it to perform a linear regression, but if the points don’t
actually fit a straight line then it’s a case of “garbage
in, garbage out.”

For instance, consider this example from
DeVeaux, Velleman and Bock 2009 [full citation at https://BrownMath.com/swt/sources.htm#so_DeVeaux2009], page 179. This is a table of recommended f/stops for
various shutter speeds for a digital camera:

Shutter speed (x)

1/1000

1/500

1/250

1/125

1/60

1/30

1/15

1/8

f/stop (y)

2.8

4

5.6

8

11

16

22

32

If you try plotting these numbers yourself, enter the shutter speeds
as fractions for accuracy: don’t convert them to decimals
yourself. The calculator will show you only a few decimal places, but
it maintains much greater precision internally.

You can see from the plot at right that these data don’t
fit a straight line. There is a distinct bend near the left. When you
have anything with a curve or bend, linear regression is wrong.
You can try other forms of regression in your calculator’s menu,
or you can transform the data as described in DeVeaux 2009 [full citation at https://BrownMath.com/swt/sources.htm#so_DeVeaux2009],
Chapter 10, and other textbooks.

Step 2. Perform the Regression

Set up to calculate statistics.

[◆] [APPS] and select Stats/List
Editor.

[F4] [3] [2] brings up the LinReg(ax+b) dialog box. You
want these settings:

X list: [alpha] [L] [X]

Y list: [alpha] [L] [Y]

Store ReqE on to: [►] and select
y1(x)

Freq: 1

Category List: (leave blank)

Include Categories: (leave blank)

Press [ENTER] to perform the regression and paste
the regression equation into Y1.

Show your work! Write
LinReg(ax+b) plus the two lists and the y-variable that
you’re using. Just “LinReg” isn’t enough.

Write down a (slope), b (y intercept), R² (coefficient of
determination), and r (correlation coefficient).
(Four decimal places for slope and intercept, and two for r and
R², is a decent rule of thumb.)

a = 3.1661, b = −55.7966

R² = 0.88, r = 0.94

Correlation Coefficient, r

“Several sets of (x,y) [pairs], with the correlation coefficient
for each set. Note that correlation reflects the noisiness and
direction of a linear relationship (top row), but not the slope of
that relationship (middle), nor many aspects of nonlinear
relationships (bottom).”
source:
Wikipedia article

Look first at r, the coefficient of linear correlation.
r can range from −1 to +1
and measures the strength of the association between x and y.
A positive correlation or
positive association means that y tends to increase as x
increases, and a negative correlation or negative association
means that y tends to decrease as x increases.
The closer r is to 1 or −1, the stronger the association.
We usually round r to two decimal places.

For real-world data, the 0.94 that we got is a pretty strong
correlation. But
you might wonder whether there’s actually an association between
club-head speed and distance traveled, as opposed to just an
apparent correlation in this sample.
Decision Points for Correlation Coefficient
shows you how to answer that question.

Be careful in your interpretation!
No matter how strong your r might be, say that changes in the y variable are
associated with changes in the x variable, not
“caused by” it.
Correlation is not causation is your mantra.

It’s easy to think of associations where there is no cause.
For example, if you make a scatterplot of US cities with x as number of books
in the public library and y as number of murders, you’ll see
a positive association: number of murders tends to be higher in cities
with more library books. Does that mean that reading causes people to
commit murder, or that murderers read more than other people? Of
course not! There is a lurking variable here: population of
the city.

When you have a positive or negative association, there are
four possibilities: x might cause changes in y, y might cause changes
in x, lurking variables might cause changes in both, or it could just
be coincidence, a random sample that happens to show a strong
association even though the population does not.

Though nobody ever computes r by hand any more, the formula explains
the properties of r. To compute r, find the z
scores of all the x’s and y’s, multiply zx
times zy for each data point, add up all the products, and
divide the total by n−1. The second formula is
equivalent but a little easier: Find the means and standard deviations
of the set of x’s and the set of y’s. For each data point,
multiply x−x̅ by y−y̅.
Add up those products and divide by n−1 times the
standard deviations.

z-scores are pure
numbers without units, and therefore r also has no units.
You can interchange the x’s and y’s in the formula without
changing the result, and therefore r is the same regardless of
which variable is x and which is y.

Why is r positive when data points trend up to the
right and negative when they trend down to the right? The
product (x−x̅)(y−y̅) explains this.
When points trend up to the right, most are in the lower left and
upper right quadrants of the plot. In the lower left,
x and y
are both below average, x−x̅ and
y−y̅ are both negative, and the product is
positive. In the upper right, x and y are both above average,
x−x̅ and y−y̅ are both
positive, and the product is positive. The product is positive for
most points, and therefore r is positive when the trend is up to
the right.

On the other hand, if the data trend down to the right, most
points are in the upper left (where x is below average
and y is above average, x−x̅ is negative,
y−y̅ is positive, and the product is
negative) and the lower right (where
x−x̅ is positive, y−y̅
is negative, and the product is negative.) Since the product is negative
for most points, r is negative when data trend down to the
right.

Regression Line, ŷ = ax+b

Write the equation of the line using ŷ, not y, to
indicate that this is a prediction. b is the
y intercept, and a is the slope.
We’ll round both of them to four decimal places, so
write the equation of the line as

ŷ = 3.1661x − 55.7966

(Don’t write 3.1661x + −55.7966.)

These numbers can be interpreted pretty easily. Business
majors will recognize them as intercept = fixed cost and
slope = variable cost, but you can interpret them in non-business
contexts just as well.

The slope, a, tells
how much ŷ increases or decreases for a one-unit increase in x.
In this
case, your interpretation is
“the ball travels about an extra 3.17 yards when the club speed
is 1 mph greater.” The sign of a is always the same as the sign of
r. (A negative slope would mean
that y decreases that many units for every one unit increase
in x.)

The intercept, b, says where the regression line crosses the
y axis: it’s the value of ŷ when x is 0. Be
careful!
The y intercept may or may not be meaningful.
In this case, a club-head speed of zero is
not meaningful. In general, when the measured x values don’t
include 0 or don’t at least come pretty close to it, you
can’t assign a real-world interpretation to the intercept.
In this case you’d say something like
“the intercept of −55.7966 has no physical interpretation
because a club-head speed of zero is meaningless for striking
a golf ball.”

Here’s an example where the y intercept does
have a physical meaning. Suppose you measure the gross weight of a
UPS truck (y) with various numbers of packages (x) in it, and you get
the regression equation ŷ = 2.17x+2463. The slope,
2.17, is the average weight per package, and the y intercept,
2463, is the weight of the empty truck.

The slope (a or m or b1) and
y intercept (b or b0) of the regression line can be
calculated from formulas, if you have a lot of time on your hands:

Traditionally, calculus is used to come up with those equations,
but all that’s really necessary is some algebra. See
Least Squares — the Gory Details
if you’d like to know more.

The second formula for the slope is kind of
neat because it connects the slope, the correlation coefficient, and
the SD of the two variables.

Coefficient of Determination, R²

The last number we look at (third on the screen) is
R², the coefficient of determination.
(The calculator displays r², but
the capital letter is standard notation.)
R² measures the quality of the regression line as a means of
predicting ŷ from x:
the closer R² is to 1, the better the line.
Another way to look at it is that R² measures
how much of the total variation in y is predicted by the line.

In this case R² is about 0.88, so your
interpretation is “about 88% of the variation in distance
traveled is associated with variation in club-head speed.”
Statisticians say that R² tells you how much
of the variation in y is “explained” by variation in x, but
if you use that word remember that it means a numerical association,
not necessarily a cause-and-effect explanation. It’s best to
stick with “associated” unless you have done an experiment
to show that there is cause and effect.

There’s a subtle difference between r and R², so
keep your interpretations straight. r talks about the
strength of the association between the variables; R² talks about
what part of the variation in the y variable is associated with
variation in the x variable. Your interpretation of R² should not
use any form of the word “correlated”.

Only linear regression will have a correlation coefficient r,
but any type of regression — fitting any line or curve to
a set of data points — will have a coefficient of determination
R² that tells you how well the regression equation predicts y
from the independent variable(s). Steve Simon gives an example for
non-linear regression in
R-squared.

Step 3. Display the Regression Line

Show line with original data points.

[◆F3makesGRAPH]

What is this line, exactly? It’s the one
unique line that fits the plotted points best. But what does
“best” mean?

The same four points on left and right. The vertical distance
from each measured data point to the line, y−ŷ, is
called the residual for that x value. The line on the right is better
because the residuals are smaller.
source: Dabes & Janik [full citation at https://BrownMath.com/swt/sources.htm#so_Dabes1999]

For each plotted point, there is a
residual equal to y−ŷ, the difference between
the actual measured y for that x and the value predicted by the line.
Residuals are positive if the data
point is above the line, or negative if the data point is below the
line.

You can think of the residuals as measures of how bad the
line is at prediction, so you want them small. For any possible
line, there’s a “total badness” equal to taking all
the residuals, squaring them, and adding them up. The
least squares regression line means the line that is best
because it has less of this “total badness” than any other
possible line. Obviously you’re not going to try different lines
and make those calculations, because the formulas built into your
calculator guarantee that there’s one best line and this is
it.

optional appendix: Display the Residuals

I would like you to know the material in
this section, but it’s not part of the MATH200 syllabus so I
don’t
require it. No homework or quiz problems will draw from this
section. You will, however, need to calculate individual residuals;
see the last section of Finding ŷ from a Regression on TI-89.

“No regression analysis is complete without a display of the
residuals to check that the linear model is reasonable.”

The residuals are automatically calculated during the
regression, and stored in a resid list in your Stats/List Editor.
All you have to do is plot them on the y axis against your existing x
data. This is an important final check
on your model of the straight-line relationship.

Return to the editor; notice that a resid list
has appeared and contains the residuals.

Mark: anything except dot (because a data dot looks just like
a dot on the grid)

X: [alpha] [L] [X]

Y: To get statvars\resid, press [2nd-makesVARLINK] and
scroll down to STATVARS. Press [►]
to expand it if necessary. Scroll down to resid and
press [ENTER].

Use Freq and categories: NO

Press [ENTER] to complete the definition.

Display the plot.

[F5] displays the plot.

You want the plot of residuals versus x to be “the most
boring scatterplot you’ve ever seen”, in
De Veaux’s words (page 203). “It shouldn’t have
any interesting features, like a direction or shape. It should stretch
horizontally, with about the
same amount of scatter throughout. It should show
no bends, and it should have no outliers. If you see
any of these features, find out what the regression model
missed.”

Don’t worry about the size of the residuals,
because [ZOOM] [9] adjusts the vertical scale so that they
take up the full screen.

If the residuals are more or less evenly distributed above and
below the axis and show no particular trend, you were probably right
to choose linear regression. But if there is a trend, you have probably
forced a linear regression on non-linear data. If your data points
looked like they fit a straight line but the residuals show a trend,
it probably means that you took data along a small part of a
curve.

Here there is no bend and there are no outliers. The scatter
is pretty consistent from left to right, so you conclude that
distance traveled versus club-head speed really does fit the straight-line model.

Residual Plot Showing Problems

Refer back to the scatterplot of f/stop
against shutter speed.
I said then that it was not a straight
line, so you could not do a linear regression. If you missed the bend
in the scatterplot and did a regression anyway, you’d get a correlation
coefficient of r = 0.98, which would encourage you to rely on the
bad regression. But plotting the residuals (at right) makes it
crystal clear that linear regression is the wrong type for this data
set.

This is a textbook case (which is why it was in a textbook):
there’s a clear curve with a bend, variation on both sides of
the x axis is not consistent, and there’s even a likely
outlier.

advanced: Residuals and R²

I said in Step 2 that the coefficient of
determination measures the variation in the measured y associated with
the measured x. Now that we have the residuals, we can make that
statement more precise and perhaps a little easier to understand.

The set of measured y values has a spread, which can be
measured by the standard deviation or the variance. It turns out to be
useful to consider the variation in y’s as their variance. (You
remember that the variance is the square of the standard
deviation.)

The total variance of the measured y’s has two
components: the so-called “explained” variation, which is
the variation along the regression line, and the
“unexplained” variation, which is the
variation away from the regression line.
The “explained” variation is simply
the variance of the ŷ’s, computing ŷ for every x,
and the “unexplained” variation is the variance of the
residuals. Those two must add up to the total variance of the measured
y’s, which means that if we express them as percentages of the
variation in y then the percentages must add to 100%. So
R² is the percent of “explained” variation in the regression,
and
100%−R² is the percent of “unexplained” variation.

and

Now I can restate what you learned
in Step 2. R² is 88%
because 88% of the variance in y is associated with the regression
line, and the other 12% must therefore be the variance in the
residuals. This isn’t hard to verify: do a 1-VarStats on the
list of measured y’s and square the standard deviation to get
the total variance in y, s²y = 59.93. Then do
1-VarStats on the residuals list and square the standard deviation to
get the “unexplained” variance, s²e =
7.12. The ratio of those is 7.12/59.93 = 0.12, which is
1−R². Expressing it as a percentage gives
100%−R² = 12% so 12% of the variation in measured
y’s is “unexplained” (due to lurking variables,
measurement error, etc.).

3 Jan 2015: “Scatterplot” is now spelled
consistently as a single word, following
Upton (2008) [full citation at https://BrownMath.com/swt/sources.htm#so_Upton2008] and
DeVeaux (2009) [full citation at https://BrownMath.com/swt/sources.htm#so_DeVeaux2009].