PROPHET StatGuide: Do your data violate one-way ANOVA assumptions?

If the
populations
from which
data to be analyzed by a one-way analysis of variance (ANOVA) were sampled
violate one or more of
the one-way ANOVA test assumptions, the results of the analysis may be
incorrect or misleading. For example, if the assumption of
independence
is violated, then the one-way ANOVA is simply
not appropriate, although another test (perhaps a
blocked one-way ANOVA)
may be appropriate. If the assumption
of normality is violated,
or outliers are present,
then the one-way ANOVA may not be the most
powerful
test available, and this could mean the difference
between detecting a true difference among the population means or not.
A nonparametric test
or employing a transformation may
result in a more powerful test.
A potentially more damaging assumption violation occurs
when the population variances are unequal,
especially if the
sample sizes are not approximately equal
(unbalanced).
Often, the effect of an assumption violation on the one-way ANOVA result depends
on the extent of the violation (such as how
unequal the population variances are, or how
heavy-tailed
one or another population
distribution
is).
Some small violations may have little practical effect
on the analysis, while other violations may render
the one-way ANOVA result uselessly incorrect or uninterpretable.
In particular, small
or unbalanced
sample sizes can increase vulnerability to assumption violations.

A lack of independence
within a sample is often caused by
the existence of an implicit factor in the data. For example,
values collected over time may be serially
correlated
(here time is the implicit factor). If the data are in a
particular order, consider the possibility of dependence.
(If the row order of the data reflect the order in which
the data were collected, an
index plot of the data [data
value plotted against row number] can reveal patterns in
the plot that could suggest possible time effects.)

Whether the samples are
independent
of each other is generally
determined by the structure of the experiment from which
they arise. Obviously correlated samples, such as a
set of observations over time on the same subjects,
are not independent, and such data would be more appropriately
tested by a one-way blocked ANOVA or a repeated
measures ANOVA. If you are unsure whether
your samples are independent, you may wish to consult
a statistician or someone who is knowledgeable
about the data collection scheme you are using.

Values may not be identically distributed because of the
presence of outliers.
Outliers are anomalous values in the
data. Outliers tend to increase the estimate of sample
variance, thus decreasing the calculated F statistic
for the ANOVA
and lowering the chance of rejecting the
null hypothesis.
They may be due to recording errors, which may be
correctable, or they may be due to the sample not being
entirely from the same population. Apparent outliers
may also be due to the values being from the same, but
nonnormal,
population.
The boxplot
and normal probability plot
(normal Q-Q plot) may suggest the presence of outliers in the data.

The F statistic is based on
the sample means and the sample variances, each of which
is sensitive to outliers.
(In other words, neither the
sample mean nor the sample variance is
resistant
to outliers, and thus, neither is the F statistic.)
In particular, a large outlier can inflate the overall
variance, decreasing the F statistic and thus perhaps eliminating a
significant difference.
A nonparametric test
may be a more powerful test in such a situation.
If you find outliers in your data that
are not due to correctable errors, you may wish to consult
a statistician as to how to proceed.

The values in a sample may indeed be from the same
population, but not from a normal one. Signs of
nonnormality
are
skewness
(lack of symmetry) or
light-tailedness or
heavy-tailedness.
The
boxplot,
histogram,
and normal probability plot
(normal Q-Q plot), along with the normality test,
can provide information on the normality of the
population distribution. However, if there are only a small number
of data points, nonnormality can be hard to detect.
If there are a great many data points, the
normality test may detect statistically significant
but trivial departures from normality that will
have no real effect on the F statistic.

For data sampled from a normal distribution, normal
probability plots should approximate straight lines,
and boxplots should be symmetric (median and mean together,
in the middle of the box) with no
outliers.

The one-way ANOVA's F test will not be much affected even if the population
distributions are skewed,
but the F test can be sensitive to population skewness if
the sample sizes are seriously unbalanced.
If the sample sizes are not unbalanced, the F test
will not be seriously affected by
light-tailedness or
heavy-tailedness,
unless the sample sizes are small (less than 5), or the
departure from normality is extreme (kurtosis less than -1
or greater than 2).

Robust
statistical tests operate well across a wide
variety of distributions.
A test can be robust for
validity, meaning that it provides P values close to the true ones
in the presence of (slight) departures from its
assumptions. It may also be robust for efficiency,
meaning that it maintains its statistical
power (the
probability that a true violation of the
null hypothesis
will be detected by the test) in the presence of
those departures. The one-way ANOVA's F test is robust for validity
against nonnormality, but it may not be the most
powerful test available for a given
nonnormal
distribution, although it is the most
powerful
test available when its test assumptions are met.
In the case of nonnormality,
a nonparametric test
or employing a transformation may
result in a more powerful test.

The inequality of the population variances can be assessed
by examination of the relative size of the sample variances,
either informally (including
graphically),
or by a robust variance test
such as Levene's test.
(Bartlett's test is
even more sensitive to nonnormality than the one-way ANOVA's F test,
and thus should not be used for such testing.)
The effect of inequality of variances is mitigated
when the sample sizes are equal: The F test
is fairly robust
against inequality of variances if the sample sizes are equal,
although the chance increases of incorrectly reporting a
significant difference in the means when none exists.
This chance of incorrectly rejecting the null hypothesis
is greater when the population variances are very different
from each other, particularly if there is one sample
variance very much larger than the others.

The effect of inequality of the variances is most severe
when the sample sizes are unequal. If the larger samples
are associated with the populations with the larger
variances, then the F statistic will tend to be smaller
than it should be, reducing the chance that the
test will correctly identify a significant difference
between the means (i.e., making the test conservative).
On the other hand, if the smaller samples are associated
with the populations
with the larger variances, then the F statistic will tend
to be greater than it should be, increasing the risk of
incorrectly reporting a significant difference in the
means when none exists. This chance of incorrectly
rejecting the null hypothesis in the case of unbalanced
sample sizes can be substantial even when the population variances
are not very different from each other.

Although the effect of unbalanced sample sizes and
unequal population variances increases for smaller
sample sizes, it does not decrease substantially
if the sample sizes are increased without changing
the lack of balance in the sample sizes. For this reason,
and because equal sample sizes mitigate the effect
of unequal population variances, the best course is
to keep the sample sizes as equal as possible.

The plot of each sample's values against its mean (or its
sample ID)
will consist of vertical "stacks" of data points,
one stack for each unique sample mean value.
If the assumptions for the samples' population
distributions
are correct,
the stacks should be about the same length.
Outliers
may appear as anomalous points in the graph.
A fan pattern like the profile of a megaphone, with a
noticeable flare either to the right or to the left
as shown in the picture (one or more of the "stacks" of data
points is much longer than the others), suggests that
the variance in the values increases in the direction
the fan pattern widens (usually as the sample mean increases), and this in
turn suggests that a transformation
may be needed.

Side-by-side boxplots of the samples can
also reveal lack of homogeneity of variances
if some boxplots are much longer than others, and reveal suspected
outliers.

If one or more the sample sizes is small, it may be difficult
to detect assumption violations. With small samples, violation assumptions
such as nonnormality
or inequality of variances
are difficult to detect even when they are present. Also, with
small sample size(s) the one-way ANOVA's F test offers less protection
against violation of assumptions.

Even if none of the test
assumptions are violated, a one-way ANOVA with small sample
sizes may not have sufficient
power
to detect any significant
difference among the samples, even if the means
are in fact different.
The power depends on the error
variance, the selected significance (alpha-) level of the test,
and the sample size. Power decreases as the
variance increases, decreases as the significance
level is decreased (i.e., as the test is made
more stringent), and increases as the sample size
increases. With very small samples, even samples from
populations with very different means may not produce
a significant one-way ANOVA F test statistic unless the sample
variance is small. If a statistical
significance test with small sample sizes
produces a surprisingly non-significant
P value, then a lack of power may be the reason.
The best time to avoid such problems is in the
design stage of an experiment, when appropriate
minimum sample sizes can be determined, perhaps in consultation
with a statistician, before data collection begins.

The one-way ANOVA test is not too sensitive
to inequality of variances
if the sample sizes are equal.
If the sample sizes are not approximately equal,
and especially if the larger sample variances are
associated with the smaller sample sizes, then
the calculated F statistic may be dominated by the
sample variances for the larger samples, so that
the test is less likely to correctly identify
significant differences in the means if the
larger samples are associated with the larger
population variances, and more likely to report nonexistent
differences in the means if the smaller samples
are associated with the larger population variances.
Unbalanced sample sizes also increase any effect due
to nonnormality, and require adjustments to be
made in calculating multiple comparisons tests.

In general, the multiple comparisons tests will be robust
in those situations when the one-way ANOVA's F test is robust,
and will be subject to the same potential problems with
unequal variances, particularly when the sample sizes are unequal.
As with the one-way ANOVA itself, the best protection against
the effects of possible assumption violations is to employ
equal sample sizes. Unequal variances may make individual
comparisons of means inaccurate, because the multiple comparison
techniques rely on a pooled estimate for the variance, based
on the assumption that the sample variances are equal.

Ideally, the sample sizes will be equal for all-pairwise multiple
comparison tests. When they are not, an adjustment must be made
to the calculations. The Tukey-Kramer adjustment (based on the
harmonic mean of each pair's sample sizes), which Prophet
uses, may be conservative (that is, it may be less likely to
flag means as different than the nominal significance level
would suggest), but in general performs well. An alternative
procedure is to use the harmonic mean of all the sample sizes
for all the pairwise comparisons. This has the disadvantage
that the actual significance level of the test is more often
different from the nominal significance level than is the
case with the Tukey-Kramer adjustment; worse, the actual
significance level of the test may be greater than the
nominal significance level, meaning that the test is more likely
to incorrectly flag a mean difference as significant.