I've been reading a lot lately about the differences between Fisher's method of hypothesis testing and the Neyman-Pearson school of thought.

My question is, ignoring philosophical objections for a moment; when should we use the Fisher's approach of statistical modelling and when should be use the Neyman-Pearson method of significance levels et cetera? Is there a practical way of deciding which viewpoint to endorse in any given practical problem?

Hello, Stijn! May I suggest that you consider making gung's excellent answer the accepted one? The only reason I write this comment is that perhaps you are unaware that you can change the accepted answer at any time.
–
amoebaFeb 17 at 0:04

@amoeba thanks for letting me know, I didn't realize I could do that. I agree that it deserves the checkmark!
–
StijnFeb 18 at 13:03

3 Answers
3

Let me start by defining the terms of the discussion as I see them. A p-value is the probability of getting a sample statistic (say, a sample mean) as far as, or further from some reference value than your sample statistic, if the reference value were the true population parameter. For example, a p-value answers the question: what is the probability of getting a sample mean IQ more than $|\bar x-100|$ points away from 100, if 100 is really the mean of the population from which your sample was drawn. Now the issue is, how should that number be employed in making a statistical inference?

Fisher thought that the p-value could be interpreted as a continuous measure of evidence against the null hypothesis. There is no particular fixed value at which the results become 'significant'. The way I usually try to get this across to people is to point out that, for all intents and purposes, p=.049 and p=.051 constitute an identical amount of evidence against the null hypothesis (cf. @Henrik's answer here).

On the other hand, Neyman & Pearson thought you could use the p-value as part of a formalized decision making process. At the end of your investigation, you have to either reject the null hypothesis, or fail to reject the null hypothesis. In addition, the null hypothesis could be either true or not true. Thus, there are four theoretical possibilities (although in any given situation, there are just two): you could make a correct decision (fail to reject a true--or reject a false--null hypothesis), or you could make a type I or type II error (by rejecting a true null, or failing to reject a false null hypothesis, respectively). (Note that the p-value is not the same thing as the type I error rate, which I discuss here.) The p-value allows the process of deciding whether or not to reject the null hypothesis to be formalized. Within the Neyman-Pearson framework, the process would work like this: there is a null hypothesis that people will believe by default in the absence of sufficient evidence to the contrary, and an alternative hypothesis that you believe may be true instead. There are some long-run error rates that you will be willing to live with (note that there is no reason these have to be 5% and 20%). Given these things, you design your study to differentiate between those two hypotheses while maintaining, at most, those error rates, by conducting a power analysis and conducting your study accordingly. (Typically, this means having sufficient data.) After your study is completed, you compare your p-value to $\alpha$ and reject the null hypothesis if $p<\alpha$; if it's not, you fail to reject the null hypothesis. Either way, your study is complete and you have made your decision.

The Fisherian and Neyman-Pearson approaches are not the same. The central contention of the Neyman-Pearson framework is that at the end of your study, you have to make a decision and walk away. Allegedly, a researcher once approached Fisher with 'non-significant' results, asking him what he should do, and Fisher said, 'go get more data'.

Personally, I find the elegant logic of the Neyman-Pearson approach very appealing. But I don't think it's always appropriate. To my mind, at least two conditions must be met before the Neyman-Pearson framework should be considered:

There should be some specific alternative hypothesis (effect magnitude) that you care about for some reason. (I don't care what the effect size is, what your reason is, whether it's well-founded or coherent, etc., only that you have one.)

There should be some reason to suspect that the effect will be 'significant', if the alternative hypothesis is true. (In practice, this will typically mean that you conducted a power analysis, and have enough data.)

When these conditions aren't met, the p-value can still be interpreted in keeping with Fisher's ideas. Moreover, it seems likely to me that most of the time these conditions are not met. Here are some easy examples that come to mind, where tests are run, but the above conditions are not met:

the omnibus ANOVA for a multiple regression model (it is possible to figure out how all the hypothesized non-zero slope parameters come together to create a non-centrality parameter for the F distribution, but it isn't remotely intuitive, and I doubt anyone does it)

the value of a Shapiro-Wilk test of the normality of your residuals in a regression analysis (what magnitude of $W$ do you care about and why? how much power to you have to reject the null when that magnitude is correct?)

the value of a test of homogeneity of variance (e.g., Levene's test; same comments as above)

any other tests to check assumptions, etc.

t-tests of covariates other than the explanatory variable of primary interest in the study

Fisher's significance testing can be interpreted as a way of deciding whether or not the data suggests any interesting `signal'. We either reject the null hypothesis (which may be a Type I error) or don't say anything at all. For example, in lots of modern 'omics' applications, this interpretation fits; we don't want to make too many Type I errors, we do want to pull out the most exciting signals, though we may miss some.

Neyman-Pearson's hypothesis makes sense when there are two disjoint alternatives (e.g. the Higgs Boson does or does not exist) between which we decide. As well as the risk of a Type I error, here we can also make Type II error - when there's a real signal but we say it's not there, making a 'null' decision. N-P's argument was that, without making too many type I error rates, we want to minimize the risk of Type II errors.

Often, neither system will seem perfect - for example you may just want a point estimate and corresponding measure of uncertainty. Also, it may not matter which version you use, because you report the p-value and leave test interpretation to the reader. But to choose between the approaches above, identify whether (or not) Type II errors are relevant to your application.

The whole point is that you cannot ignore the philosophical differences. A mathematical procedure in statistics doesn't just stand alone as something you apply without some underlying hypotheses, assumptions, theory... philosophy.

That said, if you insist on sticking with frequentist philosophies there might be a few very specific kinds of problems where Neyman-Pearson really needs to be considered. They'd all fall in the class of repeated testing like quality control or fMRI. Setting a specific alpha beforehand and considering the whole Type I, Type II, and power framework becomes more important in that setting.

I don't insist on sticking to frequentist statistics, but I was just wondering if there are situations where adopting a Fisher or Neyman-Pearson viewpoint might be natural. I know there is a philosophical distinction, but perhaps there's also a practical side to be considered?
–
StijnFeb 20 '12 at 19:00

3

OK, well pretty much just what I said... Neyman-Pearson really were concerned with situations where you do lots and lots of tests without any real theoretical underpinnings to each one. The Fisher viewpoint doesn't really address that issue.
–
JohnFeb 20 '12 at 19:20