Comparing More Than Two Means:One-Way ANOVA

When you have several means to compare, it’s
not valid just to compare all possible pairs with t tests.
Instead, you follow a two-stage process:

Are all the means equal? A computation called ANOVA
(analysis of variance) answers this question.

If ANOVA shows that the means aren’t all equal, then
which means are unequal, and by how much? There are many ways
to answer this question (and they give different answers), but
we’ll use a process called Tukey’s HSD (Honestly
Significant Difference).

Terminology

The factor that varies between samples is called the
factor. (Every once in a while things are easy.) The r different
values or levels of the factor are called the
treatments. Here the factor is the choice of fat and the
treatments are the four fats, so r = 4.

The computations to test the means for equality are called a
1-way ANOVA or 1-factor ANOVA.

Example 1: Fat for Frying Donuts

Hoping to produce a donut that could be marketed to
health-conscious consumers, a company tried four different fats to see
which one was least absorbed by the donuts during the deep frying
process. Each fat was used for six batches of two dozen donuts each,
and the table shows the grams of fat absorbed by each batch of
donuts.

It looks like donuts absorb the most of Fat 2 and the
least of Fat 4, with intermediate amounts of Fat 1 and
Fat 3. But there’s a lot of overlap, too: for instance,
even though the mean for Fat 2 is much higher than for
Fat 1, one sample of Fat 1, 95 g, is higher than five of
the six samples of Fat 2.

Nevertheless, the sample means do look different. But what
about the population means? In other words, would the four fats be
absorbed in different if you made a whole lot of batches of
donuts — do statistics justify choosing one fat over
another? This is the basic question of a hypothesis test or
significance test:
is the difference great enough that you can rule out chance?

If Fats 2 and 4 were the only ones you had data for, you’d
do a good old 2-sample t test. So why can’t you
do that anyway? because that would greatly increase your chances of a
Type I error. The reasons
are given in the Appendix.

By the way, though usually you are interested in the
differences between population means with various treatments, you can
also estimate the individual means. If you’re interested, see
Estimating Individual Treatment Means
in the Appendix.

Step 1: ANOVA Test for Equality of All Means

The ANOVA procedure tests these hypotheses:

H0: μ1 = μ2 = ... = μr,
all the means are the same

H1: two or more means are different from the others

Let’s test these hypotheses at the
α = 0.05 significance level.

You might wonder why you do analysis of variance to
test means, but this actually makes sense. The question,
remember, is
whether the observed difference in means is too large to be the result of random selection.
How do you decide
whether the difference is too large? You look at the absolute
difference of means between treatments (samples), but you also
consider the variability within each treatment. Intuitively,
if the difference between treatments is a lot bigger than the difference within treatments, you conclude that it’s not due to random chance
and there is a real effect.

And this is just how ANOVA works: comparing the variation
between groups to the variation within groups. Hence, analysis of
variance.

Requirements for ANOVA

You need rsimple random samples for the r
treatments, and they need to be independent samples. The
sample sizes need not be the same, though it’s best if
they’re not very different.

The underlying populations should be normally distributed.
However, the ANOVA test is robust and moderate departures from normality
aren’t a problem, especially if sample sizes are large and equal
or nearly equal (Kuzma & Bohnenblust 2005 [full citation at https://BrownMath.com/swt/sources.htm#so_Kuzma2005] page 180).

The samples should all have the same standard deviation,
theoretically. Because the ANOVA test is robust,
Sullivan 2011 [full citation at https://BrownMath.com/swt/sources.htm#so_Sullivan2011] page C–21 (on CD) says
it’s good enough if
the largest standard deviation is less than double the smallest standard deviation.

Miller 1986 [full citation in “References”, below] (pages
90–91) is more
cautious. When sample sizes are equal but standard deviations are
not, the actual p-value will be slightly larger than what you find in
the tables. But when sample sizes are unequal, and the smaller
samples have the larger standard deviations, the actual p-value
“can increase dramatically above” what the tables say, even
“without too much disparity” in the standard deviations.
“Falsely reporting significant results when the small samples
have the larger variances is a serious worry. The lesson to be learned
is to
balance the experiment [equal sample sizes] if at all possible.”

Perform a 1-Way ANOVA Test

A 1-way ANOVA tests whether the means of all groups are equal
for different levels of one factor, using some fairly lengthy
calculations.
You coulddo all the computations
by hand as shown in the Appendix, but no one ever does. Here are
some alternatives:

Excel’s Anova: Single Factor
command is in the
Tools » Data Analysis menu in
Excel 2003 and below,
or the Data » Analysis »
Data Analysis menu in
Excel 2007. If you don’t see it there, follow instructions in
Excel help to load the Analysis Toolpak.

On a TI-83 or TI-84,
enter each sample in a statistics list, then press
[STAT] [◄] [▲] to select ANOVA,
and enter the list names separated by commas.

There are even
Web-based ANOVA calculators, such as
Lowry 2001b [full citation in “References”, below].

There are many software packages for mathematics and statistics
that include ANOVA calculations. One of them,
R, is highly regarded and is
open source.

When you use a calculator or computer program to do ANOVA, you
get an ANOVA table that looks something like this:

SS

df

MS

F

p

Between groups(or “Factor”)

1636.5

3

545.4

5.41

0.0069

Within groups(or “Error”)

2018.0

20

100.9

Total

3654.5

23

Note that the mean square between treatments,
545.4, is much larger than the mean square within treatments, 100.9.
That ratio, between-groups mean square over within-groups mean square,
is called an F statistic (F = MSB/MSW = 5.41 in this
example).
It tells you how much more variability there is between
treatment groups than within treatment groups. The larger that ratio,
the more confident you feel in rejecting the null
hypothesis, which was that all means are equal and there is no
treatment effect.

But what you care about is the p-value of 0.0069, obtained
from the F distribution. The p-value has the usual interpretation: the
probability of the between-treatments MS being ≥5.41 times the
within-treatments MS, if the null
hypothesis is true, is p = 0.0069.

The p-value is below
your significance level of 0.05: it would be quite unlikely to have
MSB/MSW this large if there were no real difference among the
means. Therefore you
reject H0 and accept H1, concluding that
the mean absorption of all the fats is not the same.

Now that you know that it does make a difference which fat is
used, you naturally want to know which fats are significantly
different. This is post-hoc analysis.
There are several different post-hoc analyses, and no one is superior
on all points, but the most common choice is the Tukey HSD.

Step 2: Tukey HSD for Post-Hoc Analysis

If your ANOVA test shows that the means
aren’t all equal, your next step is to determine which
means are different, to your level of significance.
You can’t
just perform a series of t tests, because that would greatly
increase your likelihood of a Type I
error. So what do you do?

John Tukey gave one answer to this question, the HSD (Honestly
Significant Difference) test. You compute something analogous to a
t score for each pair of means, but you don’t compare it to
the Student’s t distribution. Instead, you use a new
distribution called the studentized range or
q distribution.

Caution:
Perform post-hoc analysis only if the
ANOVA test shows a p-value less than your α. If
p>α, you don’t know whether the means are all the
same or not, and you can’t go fishing for unequal means.

You generally want to know not just
which means differ, but by how much they differ (the
effect size).
The easiest thing is to compute the confidence interval first, and
then interpret it for a significant difference in means (or no
significant difference). You’ve already seen this relationship
between a test of significance at the α level and a
1−α confidence interval:

If the endpoints of the CI have the same sign (both
positive or both are negative), then 0 is not in the interval
and you can conclude that the means are different.

If the endpoints of the CI have opposite signs, then
0 is in the interval and
you can’t determine whether the means are equal or different.

You compute that confidence interval similarly to the
confidence interval for the difference of two means, but using the q
distribution which avoids the problem of inflating
α:

where x̅i and x̅j are the two
sample means, ni and nj are the two sample
sizes, MSW is the within-groups mean square from the
ANOVA table, and q is the
critical value of the studentized range for α, the
number of treatments or samples
r, and the within-groups degrees of freedom dfW.
The square-root term is called the standardized error (as
opposed to standard error).

Using the studentized range, developed by Tukey, overcomes the
problem of inflated significance level that I
talked about earlier. If sample sizes are equal, the risk of a
Type I error is exactly α, and if sample sizes are
unequal it’s less than α: the procedure is
conservative.
In terms of confidence intervals, if
the sample sizes are equal then the confidence level is the stated
1−α, but if the sample size are unequal then the
actual confidence level is greater than 1−α
(NIST 2012 [full citation in “References”, below] section 7.4.7.1).

How do you read the table, and how was it constructed?
Look first at the rows.
Each row compares one pair of treatments.

If you have r treatments, there will be
r(r−1)/2 pairs of means. The “/2”
part comes because there’s no need to compare Fat 1 to
Fat 2 and then Fat 2 to Fat 1. If Fat 1 is
absorbed less than Fat 2, then Fat 2 is absorbed more than
Fat 1 and by the same amount.

Now look at the columns. I’ll work through all the
columns of the first row with you, and you can interpret the others in
the same way.

The row heading tells you
which treatments are being compared in this row,
and the direction of comparison.

The next column gives the point estimate of difference,
which is nothing more than the difference or the two sample means.
The sample means of Fat 1 and Fat 2 were 72 and 85, so the
difference is −13: the sample average of Fat 1 was
13 g less fat absorbed than the sample average of Fat 2.

Next is critical q, from the confidence
interval formula. q(α,r,dfW) depends on the number of
treatments and total number of data points, not on the individual
treatments, so it’s the same for all rows in any given
experiment.

For this experiment, we had four treatments and dfW from
the ANOVA table was 20, so we need
q(0.05, 4, 20). Your textbook may have a table of critical
values for the studentized range, or you can look up q in an online
table such as the one at the end of
Abdi and Williams 2010 [full citation in “References”, below], or find it with an online calculator
like Lowry 2001a [full citation in “References”, below].
Most textbooks don’t have a table
of q, and the TI calculators can’t compute it.)

Different sources give slightly different critical
values of q, I suspect because q is extremely difficult to
compute. One value I found was q(0.05,4,20) = 3.9597.

In an experiment with unequal sample sizes, the
standardized error would vary for comparing different pairs of
treatments.
But in this experiment, every treatment has six data points, and so the
standardized error is the same for every pair of means:

√[(MSW/2)·(1/6+1/6)] =
√[(100.9/2)·(2/6)] = 4.1008

The endpoints of the confidence interval, as usual, are
the point estimate plus or minus the critical q times the standardized
error. Critical q times the standardized error is
3.9597×4.1008 = 16.2, and the difference of means in the
first row is x̅1−x̅2 = −13, so the
endpoints of the confidence interval are
−13−16.2 = −29.2 and
−13+16.2 = 3.2.

Interpretation: You’re 95% confident that, on
average, a batch of 24 donuts absorbs between 29.2 g less and
3.2 g more of Fat 1 than Fat 2.

The confidence interval for the difference between Fat 1
and Fat 2 goes from a negative to a positive, so it does include
zero. That means the two fats might have the same or different
absorption, so you can’t say whether there’s a
difference.

Caution: It’s generally best not to say that there is no
significant difference. Even though that’s literally true,
it’s easily misinterpreted to mean that the absorption of the
two fats is the same, and you don’t know that. It might be, and
it might not be. Stick to neutral language.

On the other hand, when the endpoints of the confidence
interval are both positive or both negative, then 0 is not in the
interval and we reject the null hypothesis of equality. In this
table, only Fats 2 and 4 have a significant difference.

Interpretation: Fats 2 and 4 are not equally
absorbed in frying donuts, and we’re 95% confident that a batch
of 24 donuts absorbs 6.8 g to 30.2 g more of Fat 2 than
Fat 4.

Other Comparisons

It’s possible to make more complicated comparisons. For
instance, with a control group and two treatments you might compare the
mean of the control group to the average of the means of the two
treatments. Any kind of linear comparison can be done using a
procedure developed by Henry Scheffé. A good brief
explanation of Scheffé’s method is at NIST 2012 [full citation in “References”, below]
section 7.4.7.2.

Tukey’s method is best when you are simultaneously
comparing all pairs of means. If you have pre-selected a subset of
means to compare, the Bonferroni method (NIST 2012 [full citation in “References”, below] section
7.4.7.3) may be better.

A stock analyst randomly selected eight stocks in each of
three industries and compiled the five-year rate of return for each
stock. The analyst would like to know whether any of the industries
have a different rate of return from the others, at the 0.05
significance level.

Solution: The hypotheses are

H0: = μF = μE = μU, all three industries have
the same average rate of return

H1: the industries don’t all have the same average
rate of return

You can use a normal probability plot to assess normality for
each sample; see MATH200A Program part 4. The
standard deviations of the three samples are fairly close together, so
the requirements are met.

Here is the ANOVA table:

SS

df

MS

F

p

Between groups(or “Factor”)

97.5931

2

48.7965

2.08

0.1502

Within groups(or “Error”)

493.2577

21

23.4885

Total

590.8508

23

The F statistic is only 2.08, so the variation between groups
is only about double the variation within groups. The high p-value
makes you fail to reject H0 and you
cannot reach a conclusion about differences between
average rates of returns for the three industries.

Since you failed to reject H0 in the initial ANOVA test,
you can’t do any sort of post-hoc analysis and look for
differences between any particular pairs of means.
(Well, you can, but you know in advance
that all of the intervals will include zero, meaning that you
don’t know whether any particular sector has a different return
from any other sector or not.)

Example 3: CRT Lifetimes

A company makes three types of high-performance CRTs. A random
sample finds lifetimes shown in the table at right. At the 0.05
level, is there a difference in the average lifetimes of the three
types?

Solution:
Your hypotheses are

H0: μA = μB =
μC, the three types have equal mean lifetime

H1: the three types don’t all have the same mean
lifetime

Excel or the TI-83/84 gives you this ANOVA table:

SS

df

MS

F

p

Between groups(or “Factor”)

36

2

18

4.50

0.0442

Within groups(or “Error”)

36

9

4

Total

72

11

p<α, so you reject H0 and accept H1,
concluding that the three types don’t all have the same mean
lifetime.

Since you were able to reject the null hypothesis, you can
proceed with post-hoc analysis to determine which
means are different and the size of the difference. Here is the
table:

x̅i−x̅j

Critical qq(α,r,dfW)

Standardizederror

95% Conf Intervalfor
μi−μj

Signifat 0.05?

Type A − Type B

4

3.9508

1.0328

−0.1

8.1

Type A − Type C

1

3.9508

1.0801

−3.3

5.3

Type B − Type C

−3

3.9508

0.9487

−6.7

0.7

This result might surprise you: although the three means aren’t
all equal,
you can’t say that any two of the means are unequal.
But when you look more closely at the numbers, this doesn’t seem
quite so unreasonable.

First, look at the p-value in the ANOVA table: 0.0442 is below
0.05, yes, but it’s not very far below. There’s almost a
4½% chance that we’re committing a Type I error in
rejecting H0. Next, look at the confidence interval
μA−μB.
While the interval
does include 0, it’s extremely lopsided and
almost doesn’t include 0.

Though we’re used to thinking of significance as
“either it is or it isn’t”, there are cases where the
decision is a close one, and this is one of those cases. And the
confidence intervals are computed by a different method than the
significance test, using a different distribution. Here again, the
decision is a close one. So what we have is two close decisions,
based on different computations, one falling slightly on one side of
the line and the other falling slightly on the other side of
the line. It’s a good reminder that
in statistics we’re dealing with probabilities, not certainties.

Appendix (The Hard Stuff)

The following sections are for students who want to know more
than just the bare bones of how to do a 1-way ANOVA test.

Why Not Just Pick Two Means and Do a t Test?

Remember that
you have to set up hypotheses up before you know the data.
Before you’ve actually fried the donuts, you have no
reason to expect any particular outcome. Specifically, until you
have the data you have no reason to think Fats 2 and 4 are any
more different than Fats 1 and 4, or any other pair.

Why can’t you collect the data and then select your hypotheses?
Because that can put significance on a chance event. For
example, a golfer hits a ball and it lands on a particular tuft of
grass. The probability of landing on that particular tuft is extremely
small, so there’s something different about that particular
tuft, right? Obviously not! It’s a logical fallacy to
decide what to test after you already have the data.

So if you want to do a 2-sample t test in differences
among four fats you would have to
test every pair of fats: 1 and 2, 1 and 3 1 and 4, 2 and 3, 2 and 4, 3
and 4. That’s six hypotheses in all.

Well,
why not do a 0.05 significance test on pair of means?
Remember what a 0.05 significance level means: you’re willing to
accept a 5% chance of a Type I error, rejecting H0 when
it’s actually true.
But if you test six 0.05 hypotheses on the same set of data,
you’re much more likely to commit a Type I error.
How much more likely?
Well, for each hypothesis there’s a 95% chance of escaping a
Type I error, but the probability of escaping a Type I error
six times in a row is 0.956 = 0.7351.
1−0.7351 = 0.2649, so
if you test all six pairs at the 0.05 level, you’re more likely
than one chance in four to get a false positive, finding
a difference between two fats when there’s actually no
difference.

Prob. of Type I Error

groups

pairs

α = 0.05

α = 0.01

3

3

0.1426

0.0297

4

6

0.2649

0.0585

5

10

0.4013

0.0956

6

15

0.5367

0.1399

In general, if you have r treatments, there are
r(r−1)/2 pairs of means to compare. If you test
each pair at significance level α, the overall probability
of a Type I error is
1 − (1−α)r(r−1)/2.
The table at right shows the effective α for various numbers of
treatments when the nominal α is 0.05 or 0.01.
You can see that
testing multiple hypotheses increases your α dramatically.
Even with just three
treatments, the effective α is almost three times the nominal
α. This is clearly unacceptable.

Why not just lower your alpha?
Because as you lower your α you increase your β, the
chance of a Type II error. β represents the probability of
a false negative, failing to find a difference in fats when
there actually is a difference. This, too, is unacceptable.

So you have to find a way to test all the pairs of means at
the same time, in one test. The solution is an extension of the
t test to multiple samples, and it’s called ANOVA.
(If you have only two treatments, ANOVA
computes the same p-value as a two-sample t test, but at the cost
of extra effort.)

How ANOVA Works

How does the ANOVA procedure compute a p-value?
This section shows you the formulas and carries through the
computations for the example with fat for frying
donuts.

Remember, long ago in a galaxy called Descriptive Statistics,
how the variance was defined: find the mean, then for each data point
take the square of its difference from the mean. Add up all those
squares, and you have SS(x), the sum of squared deviations in x.
The variance was SS(x) divided by the degrees of freedom
n−1, so it was a kind of
average or mean squared deviation. You probably learned the shortcut
computational formulas:

SS(x) = ∑x² − (∑x)&sup2/n or
SS(x) = ∑x² − nx̅²

and then

s² = MS(x) = SS(x)/df where df = n−1

In 1-way ANOVA, we extend those concepts a bit.
First you partition SS(x) into between-treatments and within-treatments
parts, SSB and SSW. Then you
compute the mean square deviations:

MSB is called the between-treatments mean square,
between-groups variance, or factor MS. It measures the
variability associated with the different treatment levels or
different values of the factor.

MSW is called the within-treatments mean square, within-group
variance, pooled variance, or
error MS. It measures the variability that is not
associated with the different treatments.

Finally you divide the two to obtain your test
statistic, F = MSB/MSW, and you
look up the p-value in a table of the F distribution.

(The F distribution is named after “the celebrated R.A.
Fisher”
(Kuzma & Bohnenblust 2005 [full citation at https://BrownMath.com/swt/sources.htm#so_Kuzma2005], 176).
You may have already seen the F distribution in computing a different
ratio of variances, as part of testing the variances of two
populations for equality.)

There are several ways to compute the variability, but they
all come up with the same answers and this method in
Spiegel and Stephens 1999 [full citation in “References”, below]
pages 367–368 is as easy as any:

SS

df

MS

F

Between groups(or “Factor”)

SSB = ∑njx̅j²−Nx̅²

dfB = r−1

MSB = SSB/dfB

F = MSB/MSW

Within groups(or “Error”)*

SSW = SStot−SSB

dfW = N−r

MSW = SSW/dfW

Total*

SStot = ∑x²−Nx̅²

dftot = N−1

*
or, if you know the standard deviations of the samples,

SSW = ∑(nj−1)sj²
SStot = SSB + SSW

where

r is the number of treatments.

nj, x̅j, sj for each treatment are the
sample size, sample mean, and sample standard deviation.

N is the total sample size and x̅ = ∑x/N is
the overall sample mean or “grand mean”.
x̅ can also be computed from the sample means by

x̅ = ∑njx̅j/N

You begin with the treatment means
x̅j={72, 85, 76, 62}
and the overall mean x̅=73.75, then compute

SSB =
(6×72²+6×85²+6×76²+6×62²)
− 24×73.75² = 1636.5

MSB = 1636.5 / 3 = 545.4

The next step depends on whether you know the standard
deviations sj of the samples. If you don’t, then you
jump to the third row of the table to compute the overall sum of
squares:

∑x² = 64² + 72² + 68² + ... +
70² + 68² = 134192

SStot = ∑x² − Nx̅² = 134192
− 24×73.75² = 3654.5

Then you find
SSW by subtracting the “between” sum of squares
SSB from the overall sum of squares SStot:

SSW = SStot−SSB = 3654.5−1636.5 = 2018.0

MSW = 2018.0 / 20 = 100.9

Now you’re almost there. You want to know whether the
variability between treatments, MSB, is greater than the
variability within treatments, MSW. If it’s enough greater,
then you conclude that there is a real difference between at least some
of the treatment means and therefore that the factor has a real
effect. To determine this, divide

F = MSB/MSW = 5.41

This is the F statistic. The F distribution is a one-tailed distribution
that depends on both degrees of freedom, dfB and dfW.

At long last, you look up F=5.41 with 3 and 20 degrees of
freedom, and you find a p-value of 0.0069.
The interpretation is the usual one:
there’s only a 0.0069 chance of getting an F statistic greater
than 5.41 (or higher variability between treatments relative to the
variability within treatments) if there is actually no difference
between treatments.
Since the p-value is less than α, you conclude that there
is a difference.

Estimating Individual Treatment Means

Usually you’re interested in the contrast between two
treatments, but you can also estimate the population mean for an
individual treatment. You do use a t interval, as you would when
you have only one sample, but the standard error and degrees of
freedom are different (NIST 2012 [full citation in “References”, below] section 7.4.3.6).

To compute a confidence interval on an individual mean for the
jth treatment, use

df = dfW

standard error = √(MSW/nj)

Therefore the margin of error, which is the half-width of
the confidence interval, is

E = t(α/2,dfW) ·
√(MSW/nj)

Example: Refer back to the fats for
frying donuts. Estimate the population mean for Fat 2 with
95% confidence? In other words, if you fried a great many batches of
donuts in Fat 2, how much fat per batch would be absorbed, on
average?

Computation by Hand

Begin by finding the critical t. Since
1−α = 0.95, α/2 = 0.025.
You therefore need t(0.025,20). You can find this from a table:

t(0.025,20) = 2.0860

Next, find the standard error. This is

standard error = √(MSW/nj) =
√(100.9/6) = 4.1008

Now you’re ready to finish the confidence interval. The
margin of error is

E = t(α/2,df) ·
√(MSW/nj) =
2.0860×4.1008 = 8.5541

Therefore the confidence interval is

μ2 = 85 ± 8.6 g (95% confidence)

or

76.4 g ≤ μ2 ≤ 93.6 g (95% confidence)

Conclusion: You’re 95% confident
that the true mean amount of fat absorbed by a batch of donuts fried
in Fat 2 is between 76.4 g and 93.6 g.

TI-83/84 Procedure

Your TI calculator is set up to do the necessary calculations,
but there’s one glitch because the degrees of freedom is
not based on the size of the individual sample, as it is in a
regular t interval. So you have to “spoof” the
calculator as follows.

Press [STAT] [◄] [8] to bring up the TInterval
screen. First I’ll tell you what to enter; then I’ll
explain why.

x̅: mean of the treatment sample, here 85

Sx: √(MSW*(dfW+1)/nj), here
√(100.9*21/6)

n: dfW+1, here 21

C-Level: as specified in the problem, here .95

Now, what’s up with n and Sx? Well, the calculator uses n
to compute degrees of freedom for critical t as n−1.
You want degrees of freedom to be dfW, so you lie to the
calculator and enter the value of n as dfW+1
(20+1 = 21).

But that creates a new problem. The calculator also divides s
by √n to come up with the standard error. But you want it to
use nj (6) and not your fake n (21). So
you have to multiply MSW by dfW+1 and divide by nj to
trick the calculator into using the value you actually want.

By the way, why is MSW inside the square root sign?
Because the calculator wants a standard deviation, but MSW is a
variance. As you know, standard deviation is the square root of
variance.

All this fakery achieves the desired result: the
confidence interval matches the one that you would have if you
computed it by hand.

Conclusion: You’re 95% confident
that the true mean amount of fat absorbed by a batch of donuts fried
in Fat 2 is between 76.4 g and 93.6 g.

η²: Strength of Association

Lowry 1988 [full citation in “References”, below] chapter 14 part 2 mentions a measure that is
usually neglected in ANOVA: η². (η is the Greek letter
eta, which rhymes with beta.)

η² = SSB/SStot, the ratio of sum of
squares between groups to total sum of squares. For the
donut-frying example,

η² = SSB/SStot = 1636.5 / 3654.5 = 0.45

What does this tell you? η² measures how much of
the total variability in the dependent variable is associated with the
variation in treatments. For the donut example,
η² = 0.45 tells you that 45% of the variability
in fat absorption among the batches is associated with the choice of
fat.