Multiple regression, from the program in Multiple Regression Program Page
is one of the most flexible
and powerful statistical tools available to the researcher, as it allows the modelling of multiple influences on an outcome,
correcting for the overlapping influence of the independent variables. For those who are familiar with the concepts, the algorithm
of multiple regression can be used to calculate a large number of other parametric statistical procedures.

Most professional statistical packages provide large numbers of complex statistical procedures based on multiple
regression, under the broad heading of the General Linear Model. StatTools provides the following algorithms
based on the multiple regression.

Partial Correlation Coefficient (PCor) is the correlation between an independent variable (x) with the dependent variable (y),
having corrected for inter-correlations between all the independent variables

Partial Standardised Regression Coefficient (PSReg) is the regression coefficient between an independent variable (x) and the
dependent variable (y), having corrected for inter-correlations between all the independent variables, rescaled to a mean of 0
and Standard Deviation of 1 for both. This is measurement unit free, and used for comparing the relative scale of influence
from different independent variables

Partial Regression Coefficient (PReg or b) is the regression coefficient between an independent variable (x) and the
dependent variable (y), having corrected for inter-correlations between all the independent variables. This is the b used in
the regression formula y = a + b1x1+b2x2+b3x3...

Standard Error of the Partial Regression Coefficient (SE)

t=b/SE, and α is the Probability of Type I Error (two tail) of t with residual degrees of freedom

Constant(a) is the a in the formula y = a + b1x1+b2x2+b3x3...

Please note : that, in the table of analysis if variance, although the model Degrees of Freedom is the sum of the
Regression Degrees of Freedom, the model Sums of Square is greater than the sum of (Sums of Square) from all the regression
Coefficients.
This is because the individual Sums of Squares describes the pure influence on y from each x variable, while the model sums
all of them, and add on top those Sums of Squares that overlap between the independent x variables. It is this difference
that provides the very powerful analysis of variance in complex models, where multiple measurements often have various
degrees of correlation with each other, and their pure influences and overlapping influences need to be separately accounted for.

Multiple Regression as Entered, and with Stepwise deletion
The program in the Multiple Regression Program Page
provides two options for conducting multiple regression

The as Entered model calculates multiple regression once, using all the entered data. This is the preferred model if the
intension is to provide a description of the relationship between the variables, or if the calculation is used to obtain
parameters for other complex statistical purposes.

The Stepwise Deletion model carries out repeated multiple regression analysis on the data entered, deleting the weakest
independent variable after each cycle. This is the preferred model when developing a predictive algorithm, where the
researcher starts with a large number of plausible predictors, and eliminate the weaker ones serially to obtained the most
powerful yet most parsimonious (fewest predictors) formula.

The algorithm from the program continues until only 1 independent variable left, allowing the user to determine the number of
independent variables to retain in the final formula. This can be done arbitrarily by judgement, but in most cases,
the decision is to retain only those independent variables where the Partial Regression Coefficient (b) is
statistically significant (α<0.05)

Sample size for Multiple Regression
Sample size program for multiple regression in the Multiple Regression Program Page
uses a modified version of that
for comparing multiple groups of measurement in the Sample Size for Unpaired Differences Tables Page
, but using the number of
independent variables and Multiple Correlation Coefficient R to represent the number of groups and the residual variance.
The calculations require multiple iterative approximations, so computation time increases exponentially with the number of
independent variables, and with decreasing value of R. Users are encouraged to consult the tables in the
Sample Size for Multiple Regression Explained and Tables Page
for their sample size needs.

Example 1 : Sample Size

We wish to study whether we can predict birthweight from maternal age, height
and weight, as well as gestational age and the sex of the baby, 5 independent
variables or predictors. We want this model to be clinically useful, so
requires a moderate effect size of R=0.5

We use the default example data from the Multiple Regression Program Page
for this exercise. The data
was computer generated to demonstrate the procedure and not real.

We wish to explained factors that may influenced the birth weight of babies, these being maternal age (years) and height (cms),
the gestation age at birth (weeks), and whether the baby is a girl (1) or boy (0). We collected 22 subjects, with the data showing on the left.

Var

mean

SD

1.age

27.0

4.3

2.Ht

165.2

3.2

3.Gest

37.8

2.1

4.Sex

0.5

0.5

5.BWt

3114

533

Please note : The data are in columns separated by spaces or tabs, and
the dependent variable (BWt) is in the last column.

we firstly produced the means and standard deviations of all the variables as shown to the right,
the last variable (5.BWt) is the dependent variable.

1

0.26

-0.25

-0.38

-0.10

0.26

1

0.08

-0.13

0.24

-0.25

0.07

1

-0.11

0.92

-0.38

-0.13

-0.11

1

-0.32

-0.10

0.24

0.92

-0.32

1

The correlation matrix is produced next, as shown on the right.

The multiple regression analysis now takes place. Please note abbreviations
for the coefficients table are as follows.

PCor = Partial Correlation Coefficient. This is the correlation between
the variable and the dependent variable after correction for inter-correlation
between the independent variables.
PSReg = Partial Standardised Regression Coefficient. This measures the
influence of each independent variable on the dependent variable, using z or
standardised units. For example, for 1 SD of change in maternal age, 0.01 SD of
change occurs in birthweight. For 1 SD of change in gestation, 0.9 SD of
change occurs in birthweight.
PReg = Partial Regression Coefficient. This measures the change in
the dependent variable for each unit of change in the independent variable.
For example, for an increase of 1 year in age, the baby weighs 1.7g more. For
each week of maturing, the baby weighs 223g more. Girls are 209g lighter.

var

PCor

PSReg

PReg

SE

t

α

1.age

0.0418

0.0137

1.701

9.8641

0.1724

0.8653

2.Ht

0.4395

0.1417

23.6492

11.7243

2.0171

0.0608

3.Gest

0.9493

0.8952

223.1943

17.9205

12.4547

<0.0001

4.Sex

-0.5476

-0.2009

-209.15

77.5107

-2.6983

0.0158

Const = -9165.48 R = 0.961 R2 = 0.9236

SE = standard error of the Partial Regression Coefficient.t = t test for that Partial Regression Coefficientα (p) = the probability of Type I Error (α) for that Partial Regression Coefficient.Const = the constant of the equation. In this case, BWt in G = -9165 + 1.7(age in years)
+ 23.7(height in cms) + 223.2(gestation in weeks), and -209.5 if the baby is a girl.
R = the Multiple Correlation Coefficient. This is the effect size of the equation,
R Sq is R2, the proportion of the total variance that is explained
by the regressions.

It should be noted that, although the sum of degrees of freedom from all the independent variable equals to that of the
model as a whole (in this example both = 4), this is not so for the Sums of Squares unless the independent variables are all uncorrelated with each other.
Otherwise the sum of all the individual Sums of Squares is usually less than that of the model as a whole
(in this example 4460683 and 5503642). This is because,
for each variable, the Sum of Squares tabulated is that unique to itself, excluding
the part it shares by correlation with other independent variables. The missing value, the difference between model ssq and
the sums of those from individual variables (5503642-4460683=1042958), is that attributable to the overlaps and
correlations between the independent variables.

Example 3 : Multiple Regression with Stepwise Deletion

Instead of aiming to understand the relationship between independent and dependent variables, we wish to establishe
the most efficient formula to predict birthweight. The efficiency is defined by the most accurate prediction with the
least number of independent variables. We determined to use α(p)>0.05 to delete those variables that are
inefficient predictors.

var

PCor

PSReg

PReg

SE

t

α

2.Ht

0.46

0.14

24.18

11.0

2.1985

0.042

3.Gest

0.95

0.89

222.14

16.38

13.5577

<0.0001

4.Sex

-0.59

-0.21

-214.61

68.83

-3.118

0.0063

constant (a) = -9165.37

From the first cycle of calculation in the previous example, we determined that maternal age (PSReg=0.01, t=0.17, α=0.87)
can be deleted. In the second cycle, we found the results as shown to the right.
All 3 remaining predictors now have statistically significant Partial Regression Coefficient (α<0.05), so
no further deletion is necessary, and the final prediction formula is

Please note : that the program in the Multiple Regression Program Page
progressively delete
the least significant variable at each cycle of calculations until only one variable is left in the equation. The user however
should examine the results at the end of each cycle, and decide when the stepwise deletion should stop. In this example,
stepwise deletion is stopped after the first cycle, and only maternal height had been deleted, because the decision
to delete was based on α>0.05

Concepts and BackgroundOneWay Analysis of Variance and CovarianceFactorial Analysis of Variance and Covariance

Introduction and Theoretical ConsiderationsTechnical Considerations

This section explains the relationship between multiple regression and the general model of analysis of variance and covariance.
This is done for the following reasons.

To demonstrate the underlying principles of the least squares statistical approach to the analysis of variance

To provide an understanding of One Way Analysis of Variance, the Factorial model of Analysis of Variance, and the Analysis
of Covariance

To provide a guideline on how to conduct complex Analysis of Covariance, step by step, using the algorithm of
multiple regression. Although this may still be of interest to some, it is mostly superceded by the commercially
available statistical packages, which will perform the procedures with check boxes for options, and a click of the button.

For those who do not have a clear understanding of Analysis of Covariance, the following minimal and very basic
terms and descriptions may be useful.

Variance is the square of the Standard Deviation, and it measures variations in a measurement

The Analysis of Variance partitions the variance of the dependent variable according to those factors that influence it.

In the simplest model, the analysis of variance is summarized as the t test. For example, how is the variance in
birth weight influenced by the sex (male or female) of the baby, a single comparison of the two sexes

When there are more than two groups, the general model of One Way Analysis of Variance is used. For example, how do
three different ethnic origin (say Greeks, Germans, and Slavs) influence the birth weight of the baby. with three
groups there are 3 comparisons, Greek vs Germans, Greek vs Slavs, and German vs Slavs.

When Two sets of influences (Factors) are involved (say sex and ethnicity), then a Two Way Analysis of Variance is used.
With more, Multiway Analysis of Variance. However, there may be systematic or accidental correlations between factors,
(say Greeks have more girls than Germans), and these are called Interactions between Factors. The analysis of Variance
which separates those variances unique to each factor, and those that overlapped between factors is known as the
Factorial Model of Analysis of Variance.

If, on top of all of this, as is usually the case, there are other influences to be taken into consideration, such as
differences in birth weights must be corrected by the gestational age, then one or more of these corrections are
termed covariates, and the combination of the analysis becomes Covariance Analysis.

Things now starts to become a bit complicated, because each covariate may act differently in different factors, say
German babies grow faster than Slav babies near term. This is call an Interaction between a factor and a covariate.

The total number of interactions are therefore a multiple of covariates and factors. As these increases, the model becomes
complex confusing.

To be correct, the results of a covariate analysis is only valid if all possible interactions are tested and found to be
trivial (not statistically significant). In a review of the literature however, most do not bother and assumes that
interactionsare either irrelevant or do not exist.

This panel describes to the reader the organisation of the explanations, and the example data used, in the rest of this section.

The rest of the sections are divided as follows

One Way Analysis of Variance and covariance, with the following examples

Analysis using two groups (sex of the baby) and a covariate (gestation)

Analysis using three groups (ethnicity of the mother baby) and a covariate (gestation)

Factorial Analysis of Variance and Covariance, with two factors (sex and ethnicity) and a covariate (gestation).

Sex

Ethnicity

Gest

BWt

Girl

Greek

37

3048

Boy

German

36

2813

Girl

French

41

3622

Girl

Greek

36

2706

Boy

German

35

2581

Boy

French

39

3442

Girl

Greek

40

3453

Boy

German

37

3172

Girl

French

35

2386

Boy

Greek

39

3555

Girl

German

37

3029

Boy

French

37

3185

Girl

Greek

36

2670

Boy

German

38

3314

Girl

French

41

3596

Boy

Greek

38

3312

Girl

German

39

3200

Boy

French

41

3667

Boy

Greek

40

3643

Girl

German

38

3212

Girl

French

38

3135

Girl

Greek

39

3366

The algorithm used to obtain the results will be multiple regression (as entered model), as calculated in the
Multiple Regression Program Page
. Out of all the results produced, the useful parameters used for
Analysis of Variance and Covariance are

The constant (a) and regression coefficient (b) of the regression coefficient

The degrees of freedom (df) and Sums of Square (ssq) from the Anakysis of Variance table

The dataset used for this exercise, as tabulated to the right and plotted to the left, is artificially generated by the computer to demonstrate the procedures, and they do not represent reality.
Users should also understand that real analysis requires a much larger volume of cases than that presented here.

There are 4 German boys (red) and 3 German girls (maroon), 3 Greek boys (light green) and 5 Greek girls (dark green),
3 French boys (blue) and 4 French girls (navy). All sex and ethnicity in subsequent plots will be identified by these colors.

Two GroupsThree Groups

Sex

Gest

BWt

Boy

36

2813

Boy

35

2581

Boy

37

3172

Boy

38

3314

Boy

39

3555

Boy

38

3312

Boy

40

3643

Boy

39

3442

Boy

37

3185

Boy

41

3667

Girl

37

3029

Girl

39

3200

Girl

38

3212

Girl

37

3048

Girl

36

2706

Girl

40

3453

Girl

36

2670

Girl

39

3366

Girl

41

3622

Girl

35

2386

Girl

41

3596

Girl

38

3135

We will use the data set and analyse the difference in birth weight between boys and girls, and for the moment forget the
ethnicity. The re-arranged data table is as shown to the right, and the plot as shown to the left.

However, if we were to use the regression model in Multiple Regression Program Page
, using x=0 for boys and x=1
for girls, and y=birth weight, we will obtain the formula birth weight (y) = 3268 - 149(girls). This means that the birth weight
is 3268g when x=0 (boys), and reduced by 149g when sex is 1 (girl). The t for the regression coefficient -0.95 is also the same
as that using the algorithm to compare the two groups.

In other words, the regression algorithm produces the same results as that of analysis of variance for two groups.

One way Analysis of Variance with a covariance
The One Way Analysis of Variance showed that there was no significant difference between the birth weight of boys and girls. This is because a much greater influence obfuscated the difference, the gestational age, as can be seen in the diagram.
One method of correcting for the influence
of gestational age is to draw two regression lines and compare them, using the program in the
Compare Two Regression Lines (Covariance Analysis) Program Page
. Submitting the data to that program will produce the following results.

In other words, the growth rates between boys and girls are not significantly different, at 186g/week. Having
corrected for growth rates, girls are 186g lighter than boys, which is statistically significant.

Sex

Gestation

Ia

BWt

0

36

0

2813

0

35

0

2581

0

37

0

3172

0

38

0

3314

0

39

0

3555

0

38

0

3312

0

40

0

3643

0

39

0

3442

0

37

0

3185

0

41

0

3667

1

37

37

3029

1

39

39

3200

1

38

38

3212

1

37

37

3048

1

36

36

2706

1

40

40

3453

1

36

36

2670

1

39

39

3366

1

41

41

3622

1

35

35

2386

1

41

41

3596

1

38

38

3135

We will now use the multiple regression model, and introduce the concept of interaction. Before we combined the influences
of gestational age and sex on birth weight, we must first assure ourselves that the influences of gestation are not different in
the two sexes, that boys grows faster/slower than girls near term.

We therefore create a new variable, the interaction (Ia) so that Ia = sex * Gestation, so that the data to be used are as shown to the right. We then analyse this set of data using multiple regression and obtain the following results (rounded to the nearest whole number).

The interaction = 2, t = 0.07, not statistically significant, is the same as the difference between the two slopes in the previous calculation

Had there been significant interaction, we would not be able to proceed, as the adjustment for gestation will need to be different in the two sexes. As there is no significant interaction, the multiple regression analysis can now be repeated without the interaction term, and the result is Birth Weight (g) = -3808 + (-165(girls)) + (186(Gestation in weeks)). In other words, having corrected for the
influence of gestation, girls are 165g lighter than boys.

The whole point of this exercise, to analyse the same data using comparison of two regression lines and using multiple regression,
is to demonstrate the principle underlying covariance analysis, and to demonstrate what an interaction in a multivariate set
of calculation is all about. To summarise

Multiple regression can be used to analyse multivariate statistical data

In the multi-variate situation, there is a need to check for interaction, that the influence of one variable on the outcome
is not affected by another influence.

Ethnicity

Gest

BWt

German

36

2813

German

35

2581

German

37

3172

German

38

3314

German

37

3029

German

39

3200

German

38

3212

Greek

39

3555

Greek

38

3312

Greek

40

3643

Greek

37

3048

Greek

36

2706

Greek

40

3453

Greek

36

2670

Greek

39

3366

French

39

3442

French

37

3185

French

41

3667

French

41

3622

French

35

2386

French

41

3596

French

38

3135

We will use the data set and analyse the difference in birth weight between ethnic origins, and for the moment forget sex of the
baby. The re-arranged data table is as shown to the right, and the plot as shown to the left.

In the analysis of variance, F=0.81, α=0.46,the groups are not significantly different to each other.

Multiple Regression : Introducing the dummy variable

Multiple regression requires that the independent variables to be at least ordered (3>2>1). When there are multiple
groups which are not ordered, thee is a need to create dummy variables that are ordered to represent them, using the following procedures.

The number of dummy variables = 1 less than the number of groups. For the current data of 3 ethnic groups, we will create
2 dummy variables EthnicDummy1 (ED1) and EthnicDummy2 (ED2)

For each group, we will assign it to one of the dummy variables as 1, and the remaining ones as 0, and for the last group,
we will assign it as 0 to all groups. It does not matter which group is assigned to what, providing they are identified
when the results are interpreted.

For German babies, where ED1=1 and ED2=0, the birth weight is 3290 - 245 = 3045g

For Greek babies, where ED1=0 and ED2=1, the birth weight is 3290 - 71 = 3219g

For French babies, where ED1=0 and ED2=0, the birth weight is 3219g

F for the model is 0.81, which is not statistically significant.

Except for the rounding error of 1g for German babies, these are the same results as that from One Way Analysis of Variance

Analysis of Covariance for multiple groups.

ED1(German)

ED2(Greek)

Gestation

ED1S

ED2S

BirthWeight

1

0

36

36

0

2813

1

0

35

35

0

2581

1

0

37

37

0

3172

1

0

38

38

0

3314

1

0

37

37

0

3029

1

0

39

39

0

3200

1

0

38

38

0

3212

0

1

39

0

39

3555

0

1

38

0

38

3312

0

1

40

0

40

3643

0

1

37

0

37

3048

0

1

36

0

36

2706

0

1

40

0

40

3453

0

1

36

0

36

2670

0

1

39

0

39

3366

0

0

39

0

0

3442

0

0

37

0

0

3185

0

0

41

0

0

3667

0

0

41

0

0

3622

0

0

35

0

0

2386

0

0

41

0

0

3596

0

0

38

0

0

3135

The differences between ethnic groups have been found to be not statistically significant, but this may be caused by the
much greater influence of gestational age on birth weight, as can be seen in the plot above. The inclusion of gestational age
as a covariate is therefore necessary.

As the three ethnic groups have been converted into two dummy variables ED1 and ED2, the interaction between gestation and both
ED variables will now need to be constructed. These are ED1G=ED1*Gest, and ED2G=ED2*Gest. The data is now as shown to the right,
and analysis will follow the following steps.

Step 1 : All 5 independent variables, ED1, ED2, gestation,
ED1S, ED2S, plus the dependent variable BWt, are subjected to multiple regression analysis. Although the full data output
is produced by the program, we are only interested in the model degrees of freedom (5) and Sums of Square (2544655).

Step 2 : The exercise is repeated, excluding the two interaction terms of ED1S and ED2S. The 3 independent variables,
ED1, ED2, Gestation, plus the dependent variable BWt is subjected to multiple regression analysis. Again, we are interested in
the degrees of freedom (3) and Sums of Square (2527306)

Step 3 : Analysis of Interaction Using the combined information from the two steps , we can now reconstruct the
Analysis of Variance Table obtained initially in Step 1, as shown in the table to the right. The Probability of Type I Error
for F= 0.49, with 2 and 16 degrees of freedom is α=0.63, and we can now conclude at this point that
no significant interaction exists between gestation and ethnic origin of the babies. In other words, the growth rates
of babies near term are not different in the three ethnic groups.

Birth weight increases by 192g per week near term (t=11.8, α<0.001, statistically significant)

A French baby, at term (40 weeks), averaged 40*192-4166 = 3514g

German babies (ED1) are 84g more than French babies (t=1.13, α=0.27, not statistically significant)

Greek babies are 69g more than French babies (t=1.02, α=0.32, not statistically significant)

Comments : These simple steps demonstrate the mathematical sequence used to handle complex data using the multiple regression algorithm.

The creation of binary dummy variables to replace variables with multiple groups

The creation of interaction variables between different factors, where Interaction value = Factor1 value multiplied by Factor 2 value

The double analysis of variance, with and without the interaction variables, to isolate the interaction effect. This is necessary,
because some correlation (and therefore overlapping effect) exists between different factors, and this double procedure allows
the overlap to remain with the main effect, so that the uncorrelated interactions can be isolated.

Only when there is no significant interaction, can the covariance analysis be interpreted.

Two very important concepts involved when handling multivariate data are also demonstrated in this model.

Interaction, where the influence of on factor on the dependent variable is altered by another factor. Interaction can be
helpful or unhelpful, but they need to be defined, isolated, and interpreted. An example is that interaction between sex and
gestation means boys and girls have different growth rates

Confounding, caused by correlations between factors, so that it is difficult or even impossible to identify how much each
factor affects the outcome. Confounding is always bad as it results in misleading interpretations, and the greatest virtue of
multiple regression analysis is its ability to separate the unique and overlapping parts of effects from multiple factors.
An example of correlation and confounding would be if girls are born earlier than boys, so that it is unclear
whether it is the sex or the gestation that affects birth weight.

Factorial Analysis of VarianceFactorial Analysis of Covariance

The Factorial model of Analysis of Variance was initially used in agriculture and animal laboratories, where subjects
(plants or animals) are randomly allocated to groups, which are given a combination of two or more treatments. Such a model has
many advantages

The same subject is used in a number of experiments simultaneously, thus greatly reduce the cost of research

In many cases, the combination of two treatments may have greater (synergism) or less (antagonism) effect than the sum
of their individual treatment. These are called interactions and provides additional useful information to have'

Mathematically, the analysis of Variance calculates the effect of each treatment (single factors), then in groups of
combined treatments (combined factors). The difference between the combined effect and the sums of the single effects
then represented the interaction, which can be numerically presented and statistically tested.

The two important underlying assumptions in this model are, firstly, that the treatment must be randomly and
independently allocated, so there is no correlation between treatments, and secondly, that all groups and subgroups
at different levels have the same sample size.

The Factorial model is a powerful and efficient model of investigation, so gradually it is adopted in all aspects of
psychosocial research, and into the clinical area, and from the controlled experiment to the epidemiological model. In doing so,
the important assumptions of Factorial models cannot be met, as independent variables are often not randomly allocated treatments,
but characteristics in the natural environment, and sample size availability in subgroups are seldom the same.

The sample size in the groups can only be controlled to an extent. For example, the number of boys and girls born are
never exactly the same, and to artificially create equal numbers will require removing some cases arbitrarily, and this
process itself will introduce a bias.

The difference in birth weight between boys and girls amongst Germans may be different to that amongst Greeks (interaction).
Although interaction can be useful information, in clinical investigations they often represents an unwanted distraction making
interpretation of data difficult.

We cannot allocate sex at random to different groups, and a possibililty of correlation occurs. For example, the sex ratio may
differ in different ethnic groups, so that the influence of ethnicity and sex cannot be separated (confounding).

When the assumptions of the Factorial model is violated, the results produced becomes misleading, and sometimes the numbers do not add up. When there is extensive correlation between independent variables, the overlapping influences are counted repeatedly and thus inflated in the single effects, so that the combined effect is less than the sum of the single effects, resulting in a conceptually unacceptable negative interaction.

The mathematics of multiple regression is able to resolve this difficulty, because it separates those influence (in terms of Sums of Squares) that are unique to each independent variable, and those influence that overlaps between the correlated variables. In short, it treats every factor both as an independent variable and a covariate. In most modern statistical packages therefore, the multiple regression algorithm is used for calculation even though the user interface retains the Analysis of Variance format.

Sex

Ethnicity

BWt

Boy

German

2813

Boy

German

2581

Boy

German

3172

Boy

German

3314

Boy

Greek

3555

Boy

Greek

3312

Boy

Greek

3643

Boy

French

3442

Boy

French

3185

Boy

French

3667

Girl

German

3029

Girl

German

3200

Girl

German

3212

Girl

Greek

3048

Girl

Greek

2706

Girl

Greek

3453

Girl

Greek

2670

Girl

Greek

3366

Girl

French

3622

Girl

French

2386

Girl

French

3596

Girl

French

3135

Factorial Model for Birth Weight
The data, as plotted, is shown in the diagram to the left, but for this analysis, we will ignore gestational age, and only
examine how the two factors, sex and ethnic origin, affect birth weight. The data is as shown in the table to the right.

To allow multiple regression, the 3 groups in the ethnicity factor is converted into two binary variables, as follows

For Germans, ED1=1, ED2 = 0 (German and not Greek)

For Greeks, ED1=0, ED2 = 1; (Greek and not German)

For French, ED1=0, ED2=0; (Not German and not Greek)

To allow the estimation of interaction, two additional interaction variables are created

Interaction between ED1 and sex ED1S = ED1 * sex

Interaction between ED2 and sex ED2S = ED2 * sex.

sex

Ed1

Ed2

ED1S

ED2S

BWt

0

1

0

0

0

2813

0

1

0

0

0

2581

0

1

0

0

0

3172

0

1

0

0

0

3314

0

0

1

0

0

3555

0

0

1

0

0

3312

0

0

1

0

0

3643

0

0

0

0

0

3442

0

0

0

0

0

3185

0

0

0

0

0

3667

1

1

0

1

0

3029

1

1

0

1

0

3200

1

1

0

1

0

3212

1

0

1

0

1

3048

1

0

1

0

1

2706

1

0

1

0

1

3453

1

0

1

0

1

2670

1

0

1

0

1

3366

1

0

0

0

0

3622

1

0

0

0

0

2386

1

0

0

0

0

3596

1

0

0

0

0

3135

The data is then subjected to analysis using similar steps as that for covariance analysis.

Step 1 : A two stage Analysis of Variance using the multiple regression algorithm, with and without the interaction
variables are carried out. In these analysis, only the degrees of freedom and Sums of Squares for the model are of interest.

The first calculation, including 5 independent variables of Sex, ED1, ED2,ED1S, ED2S, and the outcome variable BWt
are used. The degrees of freedom = 5, and the Sums of Square = 768244

The second calculation excludes the two interaction variables (ED1S and ED2S). Three independent variables Sex, ED1, ED2,
and the outcome variable BWt are analysed. The degrees of freedom = 3, and Sums of Squares = 400694

The Table of Analysis of Variance can now be restructured accordingly, as shown in the table to the right.

Probability of Type I Error for F=1.43, with 2 and 16 degrees of freedom, α=0.27, not statistically significant.

df

SSq

MSq

F

Inclusive of Interaction

5

768244

Exclusive of Interaction

3

400694

Attributable to Interaction

5-3=2

768244-400694=367550

367550/2=183775

183775/128561=1.43

Residual

16

2056971

2056971/16=128561

At this point therefore, we can conclude that no significant interaction exists between sex and ethnicity. In other words, the difference in birth weight between boys and girls in different ethnic groups are similar.

Step 2 : The regression formula obtained without the interaction variables can now be used to interpret the data.

Germans are 270g less than French babies in their respective sexes (t = 1.37, α=0.19)

Greek are 61g less than French babies in their respective sexes. (t = 0.32, α=0.75)

None of these differences are statistically significant

Sex

Ethnicity

Gest

BWt

Boy

German

36

2813

Boy

German

35

2581

Boy

German

37

3172

Boy

German

38

3314

Boy

Greek

39

3555

Boy

Greek

38

3312

Boy

Greek

40

3643

Boy

French

39

3442

Boy

French

37

3185

Boy

French

41

3667

Girl

German

37

3029

Girl

German

39

3200

Girl

German

38

3212

Girl

Greek

37

3048

Girl

Greek

36

2706

Girl

Greek

40

3453

Girl

Greek

36

2670

Girl

Greek

39

3366

Girl

French

41

3622

Girl

French

35

2386

Girl

French

41

3596

Girl

French

38

3135

All previous discussion on Factorial Analysis of Variance and in Covariance Analysis are subsections of the full
Factorial Covariance Model , which will be discussed in this section. The data is as presented in the table to the left,
and plotted to the right.

The aim is to analyse the influence of two factors, sex and ethnicity on the birth weight of a baby, corrected for a
single covariate, the gestational age in weeks. The algorithm to be used in the multiple regression.

As the reasons for the various procedures have already been covered in previous section, only the various stages of computation will be listed here.

Sex

ED1

ED2

ED1S

ED2S

Gest

ED1G

ED2G

ED1SG

ED2SG

BWt

0

1

0

0

0

36

36

0

0

0

2813

0

1

0

0

0

35

35

0

0

0

2581

0

1

0

0

0

37

37

0

0

0

3172

0

1

0

0

0

38

38

0

0

0

3314

0

0

1

0

0

39

0

39

0

0

3555

0

0

1

0

0

38

0

38

0

0

3312

0

0

1

0

0

40

0

40

0

0

3643

0

0

0

0

0

39

0

0

0

0

3442

0

0

0

0

0

37

0

0

0

0

3185

0

0

0

0

0

41

0

0

0

0

3667

1

1

0

1

0

37

37

0

37

0

3029

1

1

0

1

0

39

39

0

39

0

3200

1

1

0

1

0

38

38

0

38

0

3212

1

0

1

0

1

37

0

37

0

37

3048

1

0

1

0

1

36

0

36

0

36

2706

1

0

1

0

1

40

0

40

0

40

3453

1

0

1

0

1

36

0

36

0

36

2670

1

0

1

0

1

39

0

39

0

39

3366

1

0

0

0

0

41

0

0

0

0

3622

1

0

0

0

0

35

0

0

0

0

2386

1

0

0

0

0

41

0

0

0

0

3596

1

0

0

0

0

38

0

0

0

0

3135

Step 1. Preparation of the data

Sex : Boy=0, Girl=1

Creation of two dummy variables. ED1=0 for non-German and 1 for German, and ED2=0 for non-Greek and 1 for Greek

The second analysis excludes the 4 gestation related interaction variables (ED1G, ED2G, ED1SG, ED2SG). Six independent variables
of Sex, ED1, ED2, ED1S, ED2S, Gestation, and the dependent variable BWt are analysed. The model degrees of freedom is now 6,
and Sums of Squares = 2682429

The Analysis of Variance Table can now be constructed, as shown below and to the right. Probability of Type I Error for F=1.37,
with 4 and 11 Degrees of Freedom α = 0.31, not statistically significant.

df

SSq

MSq

F

Inclusive of Interaction

10

2729845.964

Exclusive of Interaction

6

2682429

Attributive to Interaction

10-6=4

2729846-2682429=47417

47417/4=11854

11854/8670=1.37

Residual

11

95369

95369/11=8670

At this point, we can conclude that there is no significant interaction involving gestation. In other words, growth rates in all groups are similar.

Step 3 : Evaluating Interaction between sex and ethnicity

As with gestation, consideration of interaction between sex and ethnicity also involves two analysis.

The first analysis includes 6 independent variables of Sex, Ed1, ED2, ED1S, ED2S, Gestation, and the dependent variable BWt,
and these are now subjected to analysis of variance using the multiple regression algorithm. The model degrees of freedom is 6,
and Sums of Square = 2682429.

The second analysis, excludes the two interaction variables between sex and ethnicity (ED1S, ED2S). Four independent variables,
Sex, Ed1, ED2, Gestation, and the dependent variable BWt, are now subjected to analysis of variance using the multiple
regression algorithm. The model degrees of freedom is 4, and Sums of Square = 2673221.

The Analysis of Variance Table can now be constructed, as shown below and to the right. Probability of Type I Error for F=0.48,
with 2 and 15 Degrees of Freedom α = 0.63, not statistically significant.

df

SSq

MSq

F

Inclusive of Interaction

6

2682429

Exclusive of Interaction

4

2673221

Attributive to Interaction

6-4=2

2682429-2673221=9208

9208/2=4604

4604/9519=0.48

Residual

15

142785

142785/15=9519

At this point, we can conclude that there is no significant interaction between sex and ethnicity. In other words, once corrected for
gestation, the difference in birth weight between boys and girls are similar in all ethnic groups.

Step 5 : Final Analysis
The regression formula in the last analysis, free of any interaction terms, can now be interpreted. T

This is a very detail textbook dealing with multiple regression and the many ways it can be used, and a very useful reference book.
It is included here however because it provides an excellent discussion on dummy variables.