Wednesday, October 17, 2012

The Story

Suppose we have two strategies (treatments) for making money.
We want to test whether there is difference in the payoffs that
we get with the two strategies. Assume that we are confident
enough to rely on t tests, that is, means are approximately normally distributed.
For some reasons, like transaction cost or cost differences, we don't care about the
difference in the strategies if the difference is less than 50 cents.
To have an example we can simulate two samples, and let's take
as a true difference a dime, 0.1

The null hypothesis for the t-test is that the two samples have
the same mean. If the p-value of the t-test is below, say 0.05,
we reject the hypothesis that the two means are the same.
If the p-value is above 0.05, then we don't have enough evidence
to reject the null hypothesis. This can also happen when the
power of the test is not high enough given our sample size.
As the sample size increases, we have more information and the
test becomes more powerful.If the true means are different, then in large samples
we will always reject the null hypothesis of equal means.
(As the number of observations goes to infinity the probability of
rejection goes to one if the means are different.)
The second test, TOST, has as null hypothesis that the difference is outside
an interval. In the symmetric case, this means that the absolute difference is
at least as large as a given threshold.
If the p-value is below 0.05, then we reject the null hypothesis that
the two means differ more than the threshold. If the p-value is above
0.05, we have insufficient evidence to reject the hypothesis that
the two means differ enough.
Note that the null hypothesis of t-test and of TOST are reversed, rejection means
significant difference in t-test and significant equivalence in TOST.

The Results

Looking at the simulated results:small sample size:

nobs: 10 diff in means: -0.14039151695
ttest: 0.606109617438 not different tost: 0.0977715582206 different

With 10 observations the information is not enough to reject the null hypothesis
in either test. The t-test says we cannot reject that they are different.
The TOST test says we cannot reject that they are the same.medium sample size:

nobs: 100 diff in means: 0.131634043864
ttest: 0.0757146249227 not different tost: 6.39909387346e-07 not different

The t-test does not reject that they are the same at a significance size of 0.05.
The TOST test now rejects the hypothesis that there is a large (at least 0.5)
difference.large sample size:

nobs: 1000 diff in means: 0.107020981612
ttest: 1.51161249802e-06 different tost: 1.23092818968e-65 not different

Both tests no reject their null hypothesis. The t-test rejects that the means are
the same. However the mean is only 0.1, so the statistically significant difference
is not large enough that we really care. Statistical significance doesn't mean it's
also an important difference.
The TOST test strongly rejects that there is a difference of at least 0.5, indicating
that given our threshold of 0.5, the two strategies are the same.

Notes

Implementation:
The t-tests are available in scipy.stats.
I wrote the first version for paired sample TOST just based on a scipy.stats ttest https://gist.github.com/3900314 .
My new versions including tost_ind will soon come to statsmodels.Editorial note:
I looked at tests for equivalence like TOST a while ago in response to some discussion on the scipy-user
mailing list about statistical significance.
This time I mainly coded, and spend some time looking at how to verify my code against SAS and R.
Finding references and quotes is left to the reader or to another time. There are some controversies
around TOST and some problems with it, but from all I saw, it's still the most widely accepted approach
and is recommended by the US goverment for bio-equivalence tests.