Handout: Overcoming bias when multiple comparisons are necessary

We saw in Lab 1.8 that it is important to be mindful of giving your data an unfair advantage to pass a significance test. In the section of Lab 1.8 titled “ANOVA and groups of 3 or larger”, we simulated 4 groups of data (called A, B, C, and D) that we knew were fundamentally from the same distributions (normally distributed values with mean 0 and standard deviation 1). When we used 6 t-tests to compare A to B, A to C, A to D, B to C, B to D, and C to D, we discovered that we observed “significant” differences (p<0.05 for any of the 6 tests) at a much higher rate than the expected 5%, because the data essentially got 6 chances to pass the 5% test. For most of you, the rate of “significant” differences for this type of test was about 20%, much higher than the expected rate of 5%. That is, using the t-test in this manner introduced a bias that made the data in the groups appear more different than they really were. When we used ANOVA, a technique that is formulated to compare the within group variance to the between (or across) group variance, there was no unexpected bias, and the groups were reported to be different (p<0.05) near the expected rate of 5%.

We applied the ANOVA test in several situations, such as in Question 4 of Problem Set 1.3. In that question, taken from Baldi and Moore1, we compared the post-surgical outcomes of rats given either a placebo or 2 homeopathic treatments (Arnica montana and Staphisagyria), administered at 2 levels of dilution.

You performed the ANOVA either by passing the different groups as different columns (as I do here) or by specifying the group identity of each data point (see help anova1):

which produced a P value of approximately 9e-45 and the graphical output below:

This low P value from the ANOVA test indicates that it is very, very unlikely that the “true” means of all of these groups are the same. And that is all it says; it says nothing about comparisons among the individual groups. But many of you were interested in making subsequent comparisons among the remaining groups. Are the outcomes of homeopathic remedies similar to one another, as is suggested by the box plots in the figure? Is the placebo really different from the homeopathic remedies?

Examining multiple comparisons with ANOVA in an unbiased way

A secondary test that is performing after an initial statistical test is called a post-hoc test (“post-hoc” is Latin for “after this”). Let’s look at a wrong way and a right way of doing this.

THE WRONG WAY:

You might imagine, now that we have established that there are differences among the groups, we could go ahead and perform paired t-tests without bias. Under this scheme, you could compare the Placebo group to the Arnica low group by performing the following analysis:

[h,p_value] = ttest2(healing_times(:,1),healing_times(:,2));

This method is less biased than the original method of just examining paired t-tests above, because any samples for which there were no differences among the groups would have already been eliminated. But it is still biased. If you performed a simulation of this method using 4 groups with no underlying differences just as we did in Lab 1.8, then you would find that you’d report “significant” differences among those 4 groups more often than 5% (but less than the 20% that we found without using the ANOVA above).

SOME RIGHT WAYS:

Mathematicians have developed several post-hoc tests that alter the t-test formula that is used to compare 2 groups so that the inherent biases that are present in the method are corrected using a factor that depends upon the number of groups and the number of data points in each group. The most popular method is the Tukey-Kramer test, which, like the ANOVA test itself, assumes that the data are normally distributed.

Matlab provides a method for performing these multiple comparisons using the function multcompare (see help multcompare to see how to select different post-hoc algorithms; “Tukey-Kramer” is the default). This function takes, as an input, the statistics generated from the anova1 function (the third output argument, that we called stats above).

comparisons = multcompare(stats);

The multcompare function brings up an interactive window that allows you to examine the comparisons among groups.

Multcompare images

2 screenshots from the interactive multcompare function. Left panel: the user has clicked on group 1 and sees that the mean of group 1 (blue) is considered to be significantly different from groups 2-5 with 95% confidence, also known as an alpha of 5%. Right panel: the user has clicked on group 2, which is different from group 1 (shown in black) but not significantly different from groups 3-5 (shown in gray).

The output variable comparison gives an estimate of the “true difference” between all pair-wise comparisons of groups.

Text Box

>>comparisons = multcompare(stats)

comparison =

1.0000 2.0000 5.4251 6.0000 6.5749

1.0000 3.0000 5.2251 5.8000 6.3749

1.0000 4.0000 5.4251 6.0000 6.5749

1.0000 5.0000 5.2917 5.8667 6.4416

2.0000 3.0000 -0.7749 -0.2000 0.3749

2.0000 4.0000 -0.5749 0 0.5749

2.0000 5.0000 -0.7083 -0.1333 0.4416

3.0000 4.0000 -0.3749 0.2000 0.7749

3.0000 5.0000 -0.5083 0.0667 0.6416

4.0000 5.0000 -0.7083 -0.1333 0.4416

The first row indicates that the “true difference” between groups 1 (Placebo) and group 2 (Arnica low) is estimated to be, with an alpha of 5% (or 95% confidence) between 5.4251 and 6.5749. Therefore, these means are very likely to be different. The 5th row shows that the “true difference” between groups 2 (Arnica low) and 3 (Arnica high) is, with 95% confidence, between -0.7749 days and 0.3749 days, so it is highly possible that there is no true difference between these groups. If you are interested in knowing the confidence interval at a different value of alpha, you can read the documentation of multcompare, which describes how to pass your own value of alpha to the function.

Overcoming bias in multiple comparisons using non-parametric tests like Kruskal-Wallis

In situations where the data are clearly not normal, such as if we have index values that vary between 0 and 1 (and several data points hit the floor or ceiling), we learned in Lab 2.3 that we need to use non-parametric tests like the Kruskal-Wallis test.

But what about if we want to perform post-hoc tests following a Kruskal-Wallis test. For example, in Team Project 1, we examined how 4 variables (offset, height, location, and width) varied across wild-type and mutant animals. If we just do several Kruskal-Wallis tests, such as offset_wild-type vs. offset_mutant, height_wild-type vs. height_mutant, etc., won’t we be introducing biases into our tests, just as we did with multiple t-tests above?

Bonferonni correction

The answer is yes, but how to correct them? The simplest and most commonly used technique for adjusting pair-wise tests for multiple comparisons is called the Bonferonni correction (because its proof is based on the Bonferonni inequalities that were first demonstrated by the Italian mathematician Carlo Emilio Bonferonni). The Bonferonni correction is the following:

Bonferonni correction for multiple comparisons

Suppose we have several P values, P1, P2, ... PN, from N pairwise comparisons, and we desire to indicate that each comparison is “significant” only if they are likely to be different with confidence 1-alpha. For example, we might want call any comparison “significant” if it is likely to be different with a confidence of 95% (alpha value of 0.05).

By the Bonferonni correction, we can say that comparison i is likely to be significant at a level of alpha if Pi < alpha / N.

Abstractexample: suppose we have conducted 2 Kruskal-Wallis tests that compared two index values, A and B, measured from a fly behavior experiment in wild-type and mutant animals. The P value P1 from the A_wild-type to A_mutant comparison was 0.01 and the P value P2 from the B_wild-type to B_mutant comparison was 0.04. We wish to use an alpha of 0.05 for our significance test. N here is 2, so we need each P value to be less than alpha / N, which is 0.025 here, in order for the comparison to be considered significant with 95% confidence. P1 is less than 0.025, so we declare the comparison to be significant, and P2 is not less than 0.025, so we cannot conclude that these quantities are different.

Why not always use non-parametric tests?

You might think that the safest course would be to always use non-parametric tests that make no assumptions about the distribution of the data, rather than using tests like the t-test and ANOVA, which assume the data are normally distributed. In fact, this is true, but there are practical reasons why we like to use tests that make assumptions when these assumptions are valid: they require less data and will have more power for the same amount of data. For example, chemist William Henry Gosset developed the t-test in 1908 as an inexpensive means to monitor the quality of stout at his employer, the Guinness brewery in Ireland. He was able to get a lot of information about the quality of Guinness’s industrial processes using relatively little data. So if there isn’t a compelling reason to use a non-parametric test, there are often advantages in terms of cost and time to using a parametric test like t-test and ANOVA.

Comparing groups of data that have more than 1 dimension

If each data point in your sample is represented by more than 1 quantity, and if these quantities are all normally distributed, then there is a powerful statistical method for determining if the samples differ in any of these particular dimensions called the MANOVA - multivariate analysis of variance. The MANOVA is beyond the scope of this course but you can read about it on Wikipedia and read about Matlab’s implementation in the function manova1 by typing help manova1.