Monday, March 25, 2013

Multiple Comparison and Tukey HSD or why statsmodels is awful

Introduction

Statistical tests are often grouped into one-sample, two-sample and k-sample
tests, depending on how many samples are involved in the test. In k-sample
tests the usual Null hypothesis is that a statistic, for example the mean
as in a one-way ANOVA, or the distribution in goodness-of-fit tests, is the
same in all groups or samples. The common test is the joint test that all
samples have the same value, against the alternative that at least one sample
or group has a different value.

However, often we are not just interested in the joint hypothesis if all
samples are the same, but we would also like to know for which pairs of
samples the hypothesis of equal values is rejected. In this case we conduct
several tests at the same time, one test for each pair of samples.

This results, as a consequence, in a multiple testing problem
and we should correct our test distribution or p-values to account for this.

I mentioned some of the one- and two sample test in statsmodels before. Today,
I just want to look at pairwise comparison of means. We have k samples
and we want to test for each pair whether the mean is the same, or not.

Instead of adding more explanations here, I just want to point to
R tutorial
and also the brief description on Wikipedia. A search for "Tukey HSD" or
multiple comparison on the internet will find many tutorials and explanations.

The following are examples in statsmodels and R interspersed with a few explanatory comments.

The Data

To make it simple, I just define the data as a numpy rec.array, and add some
imports.

From the result, we can see that we reject the hypothesis that the zero and
first treatment, that is medical and mental, have the same mean, but we cannot
reject either of the other two pairs.

The following is mostly a comparison with R, and also illustrates where the
implementation in statsmodels is lacking. The multiple comparisons and
tukeyhsd were initially written based on a SAS examples from the internet or
SAS documentation. Some of the unit tests are written against R.

Comparing my confidence interval with those of R, we can see that there is a
small difference, that most likely comes from the precision of the distribution
of the studentized range statistic, which is not a standard distribution and
is not implemented to a very high precision.

Statsmodels doesn't provide p-values for tukeyhsd, so let's try to partially
reverse engineer them. The returned results from tukeyhsd include the mean
difference and the standard deviation for the studentized range statistic.
Statsmodels also includes a function for the pvalue of the studentized range
test written by Roger Lew .

The first confidence interval of the difference in means does not include zero, so it is statistically different. For the two other pairs, we cannot reject the hypothesis that the difference is zero.

Part 2: Pairwise T-tests

Tukey's studentized range test (HSD) is a test specific to the comparison of all pairs of k independent
samples. Instead we can run t-tests on all pairs, calculate the p-values and
apply one of the p-value corrections for multiple testing problems.

Both statsmodels and R have several options
to adjust the p-values. statsmodels has among others false discovery rate
corrections fdr_bh (Benjamini/Hochberg) and fdr_by : Benjamini/Yekutieli.
In the following I will mainly show Holm and Bonferroni adjustments.

Paired Samples

In this case the assumption is that samples are paired, for example when
each individual goes through all three treatments.

Note: I'm using the data just as illustration. I did not check whether
it would actually make sense to do this with this dataset.

The answer is that the pairwise.ttest for independent samples in R, as well
as the Tukey HSD test in both packages, use the joint variance across all
samples, while the pairwise ttest calculates the joint variance estimate
for each pair of sample separately. stats.ttest_ind just looks at one
pair at a time.

Now, we could calculate a pairwise ttest that takes a joint variance as given
and feed it to mod.allpairtest. However, since this is already getting
long, I just take a shortcut.

tukeyhsd returned the mean and the variance for the studentized range
statistic. The studentized range statistic is the same as the t-statistic
except for a scaling factor (np.sqrt(2)).

Finally

By the time I wrote this blog article, I could have written a pull request
for improving this. However, I was in a bit of a grumpy mood, and I thought
I join the "statsmodels is awful" theme.

What got me started on this story is that I was browsing the htmlhelp for
current master and saw this .
As it turned out, there is the wrong function included in the documentation, tukeyhsd instead of pairwise_tukeyhsd.

4 comments:

This is super interesting. I have the requisite grad school minimum of a stats background, but don't have the understanding to code up my own hsd. I desperately need a tukey hsd in python that I can trust. Matlab has an outstanding implementation (http://www.mathworks.com/help/stats/multcompare.html) that I had gotten accustomed to, but now that my entire workflow lives in python I'd very much like to ween myself off it (but want something similar enough).

From reading your article, it sounds like you're saying that statsmodels is accurate (as compared to R), but just not very user-friendly. Would you agree? Any pending/necessary changes? If statsmodels is mathematically sound, it would be great to either modify the source to be a little prettier, and even have some native plotting functionality ala your plot and Matlab. At least we could write a nice python wrapper to do the dirty work. I might try the second option.

I was astounded to not find tukey hsd in scipy or other major modules, and to only find this wonky-to-use function in statsmodels, given its popularity in my field. Thanks for the down and dirty explanation.

The main improvement that pairwise_tukeyhsd would need is to return a results instance, where we can attach the results, the string output and the plots, instead of returning the list of heterogenous results.

I don't have any changes pending, I plan to do some cleanup, but in the meantime I still have multipletesting to finish, and I would like to get some parts done for equivalence testing (before I forget what I read in my latest set of readings).

as a related aside: In the meantime I found https://pypi.python.org/pypi/pyvttbl/0.5.2.2 by Roger Lew which has some of the same functions and more on repeated measures.

So i've taken a hack at implementing the nifty multcompare figure that Matlab generates (like the ones shown here: http://www.mathworks.com/help/stats/multcompare.html) and I think the best place to put this functionality would be in your statsmodels.stats.multicomp.MultiComparison class.

The tukeyhsd intervals are based on Hochberg's generalized Tukey-Kramer confidence interval calculations. (Hochberg, Y., and A. C. Tamhane. Multiple Comparison Procedures. Hoboken, NJ: John Wiley & Sons, 1987.) Those CI's are unique in that they allow simultaneous comparison between all means with only a single interval per group.

So I have the plot_tukeyhsd_intervals currently working on top of the MultiComparison class - is this something you'd be comfortable adding to the module if I made a pull request? This is really a lot of the functionality I was looking for in the class. Otherwise, the tukeyhsd() works splendidly once you figure out the results output.

I opened https://github.com/statsmodels/statsmodels/issues/789 if you want to discuss it there.I also linked there to what I started with pairwise comparisons of proportions.The main question for the design is whether or what to add to the MultiComparison class and what to a results class for reuse by other multiple comparisons (plots ?).