$\begingroup$That link actually makes a whole lot of sense. I used my power of common sense (because the dataset was really small) to deduce that the model is not significant. As the 1 answer mentioned below, rejection or non-rejection of the null hypothesis doesn't imply the alternate is true, which again makes me wonder what the point of stating an alternate is for (I'm a bit skeptical about the argument though).$\endgroup$
– JoeJun 13 '13 at 23:42

3 Answers
3

ISTR there is a form of hypothesis testing where the null hypothesis is the thing you want to assert to be true. IIRC this is based on statistical power, which is the probability [in a frequentist sense] that the null hypothesis will be rejected when it is false. Therefore if the p-value is above the significance level, but the test has high statistical power, then we would expect the null to be rejected if it were false as the test has high power, so the fact that it doesn't suggests it isn't, simple! ;o)

I'll see if I can remember what it is called and look it up, until then caveat lector!

Update: I think what I had in mind is "accept support" hypothesis testing, rather than "reject support" testing, see e.g. here.

Another (hopefully) illustrative update:

Climate skeptics often claim that there has been no global warming since 1998, often citing a BBC interview with Prof. Phil Jones of the Climatic Research Unit at UEA (where I also work). Prof. Jones was asked:

Q: Do you agree that from 1995 to the present there has been no statistically-significant global warming

and answered:

A: Yes, but only just. I also calculated the trend for the period 1995 to 2009. This trend (0.12C per decade) is positive, but not significant at the 95% significance level. The positive trend is quite close to the significance level. Achieving statistical significance in scientific terms is much more likely for longer periods, and much less likely for shorter periods.

The test Jones is using here is the standard reject-support type hypothesis test, where the null hypothesis is the opposite of which he would assert to be true

H0: The rate of warming since 1998 is zero.
H1: The rate of warming since 1998 is greater than zero.

Over the period concerned, the likelihood of the observations under the null hypothesis p > 0.05, which is why Prof. Jones correctly said that there had not been statistically significant warming since 1998.

However, for a skeptic to use this test to support their view that there were no global warming would not be a good idea as they are arguing FOR the null hypothesis, and reject-support hypothesis testing is biased in favour of the null hypothesis. We start off by assuming that H0 is true and only proceed with H1 if H0 is inconsistent with the observations.

What a climate skeptic should do is to perform an accept-support test, so we fix a significance level and then see if we have sufficient observations for the power of the test to be sufficient to be confident of rejecting the null hypothesis if it were actually false. Sadly computing statistical power is rather tricky (which is presumably why reject-support testing is more popular). It turns out that in this case, the test doesn't have sufficient statistical power. Combining the two hypothesis tests we find that the observations don't rule out the possibility that it hasn't warmed, nor do they rule out the possibility that it has continued to rise at the original rate (which is easily seen by looking at the confidence interval for the trend without all this hassle).

Note that Prof. Jones suggests that the likelihood of being able to find statistically significant warming depends on the length of the timescale on which you look, which suggests that he does understand the idea of the power of a test.

Hopefully this example illustrates that you can take H0 to be the thing that you want to be true, but it is so much more complicated that it is worth avoiding if you can. It is also a nice example of how the general public doesn't really understand statistical significance.

Let me give you one illustration. Testing the significance of all parameters of the model may viewed as doing a Wald test. Without getting into too much details, Wald test's statistic $W$ is a quadratic form (a "square" of vector of all coefficients) of coefficients and their inverted variance-covariance matrix. The bottom line is that $W$ can only take positive values and if all the coefficients are really zero, it has a Chi-square asymptotic distribution (which in finite samples translates into F distribution).

So if we assume that the model is not significant, we know how the distribution of $W$ - we know how this statistic behaves. And thus we are able to say that if all parameters are zero, then usually (in 90%, 95% or so... cases) $W$ should be smaller than a given critical value (see here: http://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm). If the statistic is very large we know that the probability that it would be that high even though parameters were zeros (i.e. p-value) is very small and thus we are willing to say: no, those parameters can't be zero. And there is only p-value percent of chance that we make a mistake.

But if the parameters are not zero, then $W$ statistic has a non-central chi-square asymptotic distribution with non-centrality parameter depending on the real value of parameters. This distribution is not the same as normal chi-square and has different critical values. If you take null hypothesis as "model is significant" then the Wald statistic you compute behaves in a very specific way - according to non-central Chi-square. But since we don't know the real parameters, we cannot know the non-centrality parameter and so we cannot know distribution. Therefore we are not able to say "only in 5% cases we would obtain such a statistics".

In other words, the hypothesis "modell is not significant" is very specific (it means that all parameters are zero) and thus we exactly know how the computed test statistic should behave. This allows us for making a inference.
On the other hand, hypothesis "model is significant" is very general and there is no unique distribution of test statistic under this hypothesis - $W$ can behave in different ways. Therefore we cannot really make exact inference.

This is the philosophy behind the fact that we always assume that something is zero and try to reject it.

But referring to your specific question: what does it mean that p-value is greater than 0.95? It means that Wald statistic is really close to zero so it definitely indicates that model is probably insignificant and in some sense may give your a guidance. But you cannot claim that test level is 0.05. Test level is the probability of rejecting the null when it is true. Since under your new H0 test statistic does not even have unique distribution (if we don't assume specific value of parameters we want to test for), we cannot in fact talk about any level of test.

Disclaimer: I'm new in the field, so I might be wrong in some points :)

$\begingroup$we don't always assume that something is zero and try to reject it, because sometimes we want to assert something is zero (for an example a drug manufacturer might want to assert that their drug has no side effects). Instead we normally want H0 to be the hypothesis that we really don't want to be true, this is normally "no effect", but not always.$\endgroup$
– Dikran MarsupialJun 13 '13 at 20:27

$\begingroup$This is right in general but in the context of Wald test you always test whether some linear combination is zero.$\endgroup$
– MichalJun 13 '13 at 22:28

If you think about how to do hypothesis testing through simulation then it becomes easier to imagine how this might be done and why it shouldn't be.

I collect my data and get my two means, Y1, and Y2. I make assumptions about their underlying distributions and calculate their standard deviations. Then, I simulate what happens if mu1 = mu2 (H0 is true). That is, Y1 and Y2 come from the same population. What happens is that, if Y1 and Y2 are quite far apart the likelihood of that difference or greater occurring when H0 is true is very low. So, I conclude Y1 and Y2 come from different populations only because that's all that's left in my hypothesis list. I've concluded H0 is false. But my test actually hasn't tested anything about H1 or demonstrated anything about it is true. That's essentially the kind of hypothesis testing you know.

NHST only tests against one very specific hypothesis but it doesn't have to be that mu1 = mu2, it could be anything. If you come up with a different specific H0 then you can apply the same technique. But it has to be specific.

Let's say you've decided H0 is that mu1 > mu2 by X. Immediately you're stuck because you'd have to start with arguing some true difference that you don't know. How big a difference were you looking for? One might start from a marginal power calculation. One might have to look at the literature. But either way you can see the principle problem. It's easy to do working backwards through the simulation if you have that difference. You assume that mu1 > mu2 (for example) by X. Simulate what happens if that's really true and find out the probability of finding a difference as small, or smaller than you did. If it's very unlikely then it's unlikely that mu1 > mu2 by that much.

This highlights a major weakness of the null hypothesis significance test (NHST). It's not all that obvious in the original test but you don't really demonstrate anything useful is false. You are only using logic to discard an absolute 0 difference because the current data would be unlikely if that were true. It says nothing about a tiny real difference you don't care about. The same is true in our modified test, it doesn't demonstrate that mu1 = mu2, only that the data are unlikely if mu1 > mu2 by X. The logic is applicable to any point estimate. A "significant" NHST says the difference is not difference X (which might be 0) but I have not excluded any other differences (I'm conflating one way and two way tests here a bit for simplicity).

The short answer is that NHST cannot be used to demonstrate that H0 is true because it can't be used to demonstrate that H1 is true, only that the observed data is unlikely if H0 is true (which isn't even that H0 is false).

Typically people handle such issues by working out how much of a difference matters and whether the difference found is relatively small (regardless of the test). Furthermore, a small difference can be reliably so independent of significance tests. This would be demonstrated by not just giving point estimates but ranges of likely values (confidence intervals). That's far more informative than a significance test.

$\begingroup$This makes me wonder then, what is the point of stating the alternate hypothesis then if you're saying it can't be demonstrated to be true?$\endgroup$
– JoeJun 13 '13 at 23:36

$\begingroup$Keep in mind that while the H0 is specific the H1 is very vague. The alternate is essentially, what's left when H0 is gone. When I say you can't prove the H1 consider how convoluted the logic really is here. The statistic you calculate is the probability of the data if the H0 is true. If you decide that's too low to believe H0 you then discard H0. But wait, now you have a statistic that means nothing because you've decided H0 is true and you're left with little to stand on to argue for anything.$\endgroup$
– JohnJun 14 '13 at 2:02

$\begingroup$Ok I see your point now. A lot of the times H1 is indeed very very vague, which probably means that even though you are technically accepting H1, it isn't conclusive. Thanks$\endgroup$
– JoeJun 14 '13 at 16:11