In this section we will describe how to use the Excel
2010 for Window to carry out some of the procedures we have
described in the book. While this approach uses Excel 2010, I don't
think that the regression procedure has changed (much) between
different
versions.

To do regression in Excel, you need the Analysis Toolpak add-in to be
installed in Excel.

Check that the add-in is installed, and added-in, by clicking the File
menu, and choosing Options.

Click the button that says "Go" at the bottow of the window.

Make sure that "Analysis ToolPak" is selected.

You can now use the data analysis functions in Excel, which include
multiple regression.
The example that we will work through is taken from dataset 6.1b in the
book "Applying regression and correlation" (if you jumped straight in
here,
that is what these web pages is about. Don't worry about that though,
you
can download the data from here
.)

To get to the data analysis functions in Excel, you select the Data
menu, and then choose Data Analysis (it's on the
right)..

This gives the following Dialog, click on Regression
and then click
OK.

The following window appears:

In here, we tell Excel about the data that we would like to analyse.
The first box is the input Y range. Here, we tell Excel about
our outcome
variable. The ou variable must be a column, 1 cell
wide, and
N cells long (where N is the number of individuals that we are
analysing).
The the dataset we are using, the dependent variable is Anx, which is
the
column which goes from cell D1 to Cell D41. You can either
type this
information in directly as D1:D41, or you can select the appropriate
data
from the spreadsheet with the mouse.

Because we have included row 1, which includes the variable name, we
are
going to have to tell Excel this, by clicking on the "Labels" checkbox.

The next stage is to input the predictor variables. The
predictor
variables must be a block of data, of k columns (where k is the number
of predictor variables) and N rows (where N is still the number of
people).
In the dataset we are using we have three independent
variables: hassles,
hassles2 and hassles3. (These represent the linear, quadratic
and cubic
effects of hassles - we are analysing a non-linear relationship here,)
These
are held in rows 1 - 41 of columns A, B and C. Again, we can
type in
A1:C41 or select the data from the spreadsheet with the mouse- it will
have the same
effect.

Next we tell Excel where we want the results to be written.
It is best
to ask for a new sheet - you don't want to accidentally overwrite some
of
your precious data, and have to go to all of the effort of restoring it
from
a backup,. (You do have a backup, don't you?)

We can ask fro residuals and standardised residuals to be saved - these
will
be new columns of numbers created in the the new spreadsheet.

Two types of graphs will be drawn automatically if you ask for them.

A residual plot will draw scatter plots of each
independent variable
on the x-axis, and the residual on the y-axis.

A line fit plot will draw scatterplots of each
independent variable
on the x-axis, and the predicted and actual values of the dependent
variable
on the y axis.

You cannot, as far as I have been able to determine, automatically have

A scatterplot with the predicted values on the
x-axis, and the
residuals on the y-axis (although you can calculate these values and
save
them.)

You can also request a normal probability plot. This appears to be a
plot
of the dependent variable, which is a curious thing to plot -
regression
analysis does not assume normal distribution of the dependent
variable.
The usual plot of this type would be the residuals, but this is not
possible
in Excel.

The dialog box now looks like this:

.
So, finally, we click OK.

And we get a lot of output, written to a new sheet. A note
about this
output - output from analysis in Excel is usually "live" that is to
say,
the data are linked to the output. If you change the data,
you will
change the output. This is not the case for this type of
output in
Excel. The results of the analysis are "dead" and will not
change.

The Results

The results presented below have been copied from Excel to HTML, so
they
don't look identical. You can download the Excel file exactly
as Excel
wrote it for me here
.

Regression Statistics

Multiple R

0.807582

R Square

0.652189

Adjusted R Square

0.623205

Standard Error

6.686129

Observations

40

The first part of the output is the
regression statistics.
These are standard statistics which are given by most
programs.

ANOVA

ANOVA

df

SS

MS

F

Significance F

Regression

3

3017.745

1005.915

22.50151

2.22E-08

Residual

36

1609.355

44.70432

Total

39

4627.1

The ANOVA table comes next.
This gives a test
of significance of the R2. Note that
Excel uses scientific
notation, by default, so when it says 2.22E-08 it means, 2.22 * 10-8
. (i.e. 0.0000000222).

Coefficients

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept

12.39

3.90

3.18

0.00

4.48

20.29

4.48

20.29

HASSLES

-0.06

0.34

-0.19

0.85

-0.75

0.62

-0.75

0.62

HASSLES2

0.00

0.01

0.06

0.95

-0.02

0.02

-0.02

0.02

HASSLES3

0.00

0.00

0.51

0.61

0.00

0.00

0.00

0.00

The next stage is the coefficients.
Note that
here I have converted the numbers to 2 decimal places to save space).
It gives the coefficient for each parameter, including the intercept
(the
constant). The standard errors, and the t-values follow (the
t-value
is the coefficient divided by the standard error). Next comes
the p-value
associated with the variable, and the confidence intervals of the
parameter
estimates (Excel gave these to me twice, even though I didn't ask for
them.)

Residuals

Observation

Predicted ANX

Residuals

Standard Residuals

1.00

12.08

-2.08

-0.32

2.00

11.82

0.18

0.03

3.00

15.98

5.02

0.78

4.00

29.86

-13.86

-2.16

5.00

28.56

-1.56

-0.24

6.00

34.16

-4.16

-0.65

7.00

12.32

-3.32

-0.52

8.00

12.37

-5.37

-0.84

9.00

27.31

4.69

0.73

10.00

15.64

-4.64

-0.72

The final part of the output is the
residual information.
The observation in the left had column is the case number -
although
Excel never told us about this, it has labelled the first person
Observation
1, the second Observation 2, etc. (Note
that this is NOT
the original row number - Observation 1 was row 2).
The predicted anxiety score is the score that was predicted from the
regression
equation. The residual is the raw residual - that is the
difference
between the predicted score and the actual score on the dependent
variable.
The final value is the standardised residual (the residuals adjusted to
ensure
that they have a standard deviation of 1; they have a mean of zero
already).

Graphs

Finally we will have a quick look at the graphs.

We asked for (and we got) residual plots - but what we really wanted
was the plot of the residuals against the predicted values.
In linear regression, this would be fine. In multiple
regression, it's not what we want.

The second graphs shows the predicted and actual anxiety scores plotted
against
hassles. This graph is useful, as it shows us the non-linear nature of
the relationship.

A Note to End On
I have written this file which shows you how to do regression in Excel,
but
this does not mean that I think that you should be doing regression in
Excel.
Regression in Excel has a number of shortcomings, which
include:

No standardised coefficients.
It can be very
difficult to interpret unstandardised coefficients. You could
calculate
the standardised coefficients using the unstandardised coefficients, if
you
really wanted to. But you could have done the regression on
your own,
if you really wanted to.

Lack of diagnostic graphs.
The standard diagnostic
graphs are not available in Excel, e.g. the normality plot of the
residuals,
the scatterplot or residuals against predicted values. Again,
you can work them out, but it ain't easy.

Lack of Diagnostic statistics.
There are no collinearity
diagnostics, which would help you to understand what was happening in
the
data that we analysed above (highly significant R2,
but no significant
parameters in the model). You could calculate the skew
statistics,
but it's a pain.

Lack of features. There
is no hierarchical regression,
no weighting cases, etc, etc, etc...

Inflexibility. If you
want to run a slightly different
analysis, it is hard work, because you have to move your data around, a
process
which is prone to errors.