I really liked your article! Indeed, the point about scientific inference being about information COMBINATION is the critical shift in perspective that we need today. It implies that decision criteria should be eschewed, and that multiple evidentiary measures (and original data) should be reported, so that results can be more easily combined.

I’m really chuffed that there’s so much activity from so many great thinkers on this issue now!

]]>By: Shlomo Engelson Argamonhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-590186
Wed, 18 Oct 2017 13:27:37 +0000http://andrewgelman.com/?p=36807#comment-590186Wow – a significant point indeed! This is something that clearly needs to be widely known. The “magical threshold” problem, of course, is much more general.

(Would you mind writing this as a comment to my post as well, for that readership, or may I copy it there? Thanks!)

You write, “Nothing magical happens when going from a p-value of 0.051 to one of 0.049.” But it’s much worse than that! Actually, nothing magical happens when going from a p-value of 0.005 (z=2.8) and a p-value of 0.2 (z=1.3): the difference between these z-values is a mere 1.5 which is nothing remarkable at all, given that the difference between two independent random variables, each with standard error 1, will have standard error 1.4. This was the point that Hal Stern and I made in our paper, The difference between “significant” and “not significant” is not itself statistically significant. The problem is not just with an arbitrary threshold, it’s a more fundamental issue that the p-value is a very noisy statistic. Practitioners seem to think that p=0.005 is very strong evidence while p=0.2 is no evidence at all, yet it’s no surprise at all to see both of these from two different measures of the very same effect.

The lexicographic gatekeeping function (or as I call it “decision criterion science”) is pernicious in just so many ways, and prevents a clear understanding of the true evidentiary value of each given result, and how it must be considered in combination with others.

]]>By: Martha (Smith)http://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-577019
Tue, 03 Oct 2017 21:46:00 +0000http://andrewgelman.com/?p=36807#comment-577019“I do agree that significance testing can be especially dangerous in complex environments insofar is it leads people to think in simple terms and downplay uncertainty.”

+1

]]>By: Ben Prytherchhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-576837
Tue, 03 Oct 2017 19:12:48 +0000http://andrewgelman.com/?p=36807#comment-576837“From what I can tell, the argument is that is not possible to adequately describe and evaluate a complex system.”

I think this is often the case, but I’m not sure it’s the crux of the problem. Even applied to relatively simple systems, significance testing combined with noisy measurements and small effects and publication bias and flexibility in analyses and hypothesizing will lead to the same problems – people using patterns in noise to fool themselves and others into thinking they’ve discovered underlying truths.

I do agree that significance testing can be especially dangerous in complex environments insofar is it leads people to think in simple terms and downplay uncertainty.

]]>By: Vangel Vesovskihttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-575381
Mon, 02 Oct 2017 14:30:27 +0000http://andrewgelman.com/?p=36807#comment-575381I am clearly not as smart of familiar with the issue as you guys so please correct me if I misunderstand the point. From what I can tell, the argument is that is not possible to adequately describe and evaluate a complex system. The crisis in the biomedical and social sciences comes from the fact that true empirical evaluation is impossible and that creates a replication problem that can be ‘solved’ by ensuring that those who are smart enough to cut through the verbiage will understand that the results are not particularly meaningful in a general way. But that leads me to question the need for all of the supposed ‘scientific’ testing in the first place. Why not just say that the uncertainties are large so anyone wishing to use a particular product for medical purposes is taking risks that are not easily quantifiable? That would mean that we would not have to waste massive amounts of resources trying to go through bureaucratic approval processes that are denying individuals access to possible cures and that manufacturers of dangerous products cannot hide behind a faulty approval process.

Can anyone of us imagine what someone like Richard Feynman would say when we talked about 95% confidence intervals being meaningful in any way?

]]>By: Keith O'Rourkehttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-575299
Mon, 02 Oct 2017 11:56:07 +0000http://andrewgelman.com/?p=36807#comment-575299> need to ban is science done by people who don’t care about logic.
Interesting – wonder what the percentages of those who adequately care about logic are in various disciplines.

]]>By: Coreyhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574904
Sun, 01 Oct 2017 20:48:34 +0000http://andrewgelman.com/?p=36807#comment-574904Also not the only extension per se; just the only one that lives in a dense subset of the reals and satisfies some regularity conditions. You can relax those desiderata and get other systems.
]]>By: Martha (Smith)http://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574875
Sun, 01 Oct 2017 19:48:41 +0000http://andrewgelman.com/?p=36807#comment-574875How about “Uncertainty and Evidence”?
]]>By: ojmhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574865
Sun, 01 Oct 2017 19:40:25 +0000http://andrewgelman.com/?p=36807#comment-574865Figured you just misspoke:-)
]]>By: Daniel Lakelandhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574857
Sun, 01 Oct 2017 19:19:04 +0000http://andrewgelman.com/?p=36807#comment-574857Whoops that’s what I meant, just sloppy, thanks for the catch.
]]>By: ojmhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574835
Sun, 01 Oct 2017 18:38:20 +0000http://andrewgelman.com/?p=36807#comment-574835> No, this is not true, the Cox theorem says that the only generalization of predicate logic…

Very off topic but Cox’s theorem doesn’t generalise predicate logic, it ‘generalises’ simple _propositional_ logic. Extending probabilistic logics to interact well with quantifiers is to some extent an unsettled question.

” The question really is whether it has anything to do with data collection in scientific enterprises. » That is THE question. I agree, of course, with the frequentist interpretation of probability and recognize that p-value may be useful, but only under specific situation, for instance to check properties of an estimator, a situation that pertains to the statistician, not the lay user of t-test and chi-square.
But, if I understand and see the logic in the computation of a p-value in the framework of NHT, I do not understand the usefulness of “data more distant from the null than the observed one” in decisional problem. This is the specific context I wanted to talk about. From a practical (at least medical) point of view, I don’t see how they could be justified (hence my remark on the likelihood principle). It is as if a physician would consider a body temperature of 40° when in fact the patient is 39°. Non-observed data are not relevant in the process. Do you have a real, practical, exemple of a situation in which using a p-value would be really useful ?
A ban on p-value may be too strong an answer but I stick to the fact that the message to the community of users must be very clear. Maybe the real or major problem is the fact that the majority of consumers and producer of statistics are not statisticians themselves. As I always say to my colleagues, I do not do surgery in my kitchen, as an amateur, please do not do statistics on your own. They will go using a recipe, and the recipe must be as clear as possible.

]]>By: Daniel Lakelandhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574662
Sun, 01 Oct 2017 13:20:57 +0000http://andrewgelman.com/?p=36807#comment-574662“Moreover, the Cox theorem indicates that the only valid interpretation of probability must be done based on the Bayesian paradigm”

No, this is not true, the Cox theorem says that the only generalization of predicate logic that agrees with binary logic in the limit is probability calculus.

The frequency interpretation of probability is a totally valid mathematical construct. It corresponds to the behavior of certain types of infinite sequences of numbers. The question really is whether it has anything to do with data collection in scientific enterprises.

p values do exactly one thing, they tell you how probable a certain range of test statistics is if the data comes out of an algorithmically random mathematical sequence. There is nothing wrong with this logic. What’s wrong is *applying this logic where it’s inappropriate*.

although I agree with your diagnosis of what is wrong with people who use p values and how they cut first and ask questions later… I think it goes too far to say that we should somehow “ban” the p value. What we need to ban is science done by people who don’t care about logic.

I read your paper and this post with great interest. It’s brilliant and thought-provoking, as usual.
But, nevertheless, I must say that I am a bit flabbergasted by one point: while you remind of the different flaws of the p-value, your proposal for the future do nevertheless still include p-values in the general process of statistical learning and decision. In my view, this may considerably blur the message.

Maybe I am totally mistaken and p-value may have some interest, but which one ? I really don’t understand and if it has an interest, what is wrong in all the papers that for so long have opposed its flaws ?

My understanding of the issue is that the problem is not (only) in the .05 threshold used, or whatever value it may take, but in the efinition and nature itself of the p-value, and its native defaults : non respect of the likelihood principle, lack of incorporation of prior values etc. Moreover, the Cox theorem indicates that the only valid interpretation of probability must be done based on the Bayesian paradigm. Then, be they melted or not in a more global view of the statistical analysis of a study, p-value still have those defaults. What is the use of a defective tools and thus what is the use of including it in the statistical reflexion ?
So, I fear that keeping p-value, even when incorporated in a more flexible and fluid reflexion on the data and after getting rid of any magic threshold, will not be accompanied by a better reflexion. I work as a biostatistician in a french medical university and I can assure you that almost all my clinician colleagues cut information first (more or less than 0.05) and then (sometime) discuss the methodology of the study, while we should do the reverse and publish all results, be they positive or negative. They have been wrongly and empirically but heavily taught by the literature for so long to cut anything that comes up at a 5% levels. As long as there is something to cut, they will cut it first, and have a reflexion afterwards. It will take decades before we may get rid of p-values if physician were « authorized » to use p-values, even without threshold and even along with other statistical tools.
The only way out of this mess is to completely get rid of p-values. There will be shortly the ASA meeting on “the world beyond p < 0.05 ». I think that a very clear message must be given, otherwise it will be difficult to advocate a p-value-less world. Leaving value make their way in a paper will only muddle a study message. My colleagues judge almost only by the p-value, and habits are such that if p-values continues to appear in the results they will still be the only piece of evidence that they will rely on. I would bet on that : I am certain to win. The reproducibility issues, the selection of only positive value will then go on.

And things could always be worse:
5%
0.05
To p<0.05, or not to p<0.05?

]]>By: Kyle Chttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574324
Sun, 01 Oct 2017 00:24:22 +0000http://andrewgelman.com/?p=36807#comment-574324Mayo’s prose is so impenetrable that she can always claim to have been misunderstood. Perhaps even “impenetrable, so that …. “
]]>By: Ben Prytherchhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574273
Sat, 30 Sep 2017 22:27:23 +0000http://andrewgelman.com/?p=36807#comment-574273Whoops, the URL didn’t need to be in there twice
]]>By: Ben Prytherchhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574268
Sat, 30 Sep 2017 22:22:31 +0000http://andrewgelman.com/?p=36807#comment-574268Kyle, I looked back over the comments thread in question and I put some words in her mouth in my post above. Here it is:

She does state that the classical “error probabilistic” method is “the most natural way in which humans regularly reason in the face of error prone inferences”, and that “you could say that when people assign the probability to the result they are only reflecting an intrinsic use of probability as attaching to the method”. This was in response to my claim that when people interpret the p-value as “the probability the results occurred due to chance”, they are mistakenly interpreting the p-value as the probability of the null. I take her response to mean that she thinks people who interpret the p-value as “the probability the results occurred due to chance” are aware of the fact that the probability statement refers to the error rate of the procedure and not the null itself.

]]>By: Kyle MacDonaldhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574088
Sat, 30 Sep 2017 16:10:54 +0000http://andrewgelman.com/?p=36807#comment-574088If Mayo in fact does believe that most people think in a way approximately captured by your quotation marks (“there is some fixed truth … random procedure”), then I would be fascinated to read her defence of this belief.
]]>By: Noah Motionhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-574056
Sat, 30 Sep 2017 15:22:10 +0000http://andrewgelman.com/?p=36807#comment-574056With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho.

Modus Tollens applies to one-off situations. If we have “If A, then B,” then a single observation of not-B suffices to conclude not-A. In addition to this not working in general when probability is added to the mix, this (appropriate) mention of reliable production of particular results makes it clear that we’re not talking about one-off situations. Rather, we’re talking about exactly the kind of thing that McShane et al discuss (among many others, including Mayo in other places), namely how design, measurement, and statistical tools relate to each other, and how these jointly license inferences.

This quote also make it look an awful lot like Mayo granting that a single statistically significant result carries little, if any, evidential weight, since by itself, it doesn’t indicate that we can, in fact, reliably produce a particular effect. This bolsters the argument from McShane et al that statistical significance should not be used as a publication filter.

I agree; I didn’t mean to imply otherwise. I don’t like dichotomization of evidence at all. I tend to take a stance that embracing uncertainty is useful, and if a decision is to be made, it’s with respect to some posterior uncertainty (or that + a utility function). I’d rather describe a posterior distribution, or in the least some estimate + some error, and either let the reader decide for themselves or formalize a decision in the context of theory or prediction. E.g., we have had posteriors on interactions that were essentially N(0,.1), and I more or less said “if there’s an effect, we can’t tell whether it’s + or -, but regardless it appears to be tiny and negligible.

But with that said, IF one makes a decision about some substantive hypothesis, I just meant that with the logic of probabilistic modus tollens, posterior quantities or basically non-NHST derived quantities can tell you the same thing. IF you’re going to make a decision about some substantive hypothesis, you can do so via the posterior. “If H0, then negative; effect probably positive, therefore not H0.”

I posted this elsewhere recently, which is relevant to this whole discussion.

I think one can say that the VAST majority of nil-nulls are false, to the point where nil-nulls should be the exception, not the rule.

And exactly: If nil-nulls are false, then there really isn’t a point in using a nil-null hypothesis test.

True nulls can exist, in very very few cases. E.g., when a physical law is in question. Like… ESP would actually require some rethinking about foundational physical laws if it were true; given our physical constraints, there truly can be a ZERO effect, as in it truly cannot occur.

But in the social sciences, we rarely have these sorts of questions, because most of our questions don’t require rewriting a constraint due to a physical law. Otherwise, everything is connected in some way, and out to SOME decimal place, there is an effect of everything on everything else.

The only realistic cases in psych where some statistic may be truly zero is only when the measurement sucks too badly to even detect a difference in the population. Like, imagine on some latent continuum, the statistic is .000001, but our measure only permits .1 differences at its smallest. Then even if we had every single person in the population, we would see zero difference, but only because our measure sucks, not because there’s literally zero difference. In this case, conditional on our measure the true effect could be “zero”, in the sense that asymptotically and with everyone in the population, the true value is zero; but that is not true unconditional of our measure. With a more precise measure, the true value would NOT be zero, but perhaps .000001.

]]>By: Martha (Smith)http://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-573547
Sat, 30 Sep 2017 01:48:18 +0000http://andrewgelman.com/?p=36807#comment-573547I heartily agree with your last sentence. (In particular, I’ve seen a lot of textbooks that get it wrong, or at least explain significance tests in such a watered-down version that it’s subject to misinterpretations.)
]]>By: Ben Prytherchhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-573473
Sat, 30 Sep 2017 00:17:55 +0000http://andrewgelman.com/?p=36807#comment-573473I love this last sentence:

Yes – and I think the generalization of modus tollens being used in practice is more like “if A, then probably not B; B, therefore probably not A”:

If H0, t(x) probably not greater than t*; t(x) greater than t*, therefore probably not H0.

Now, I know Mayo and other proponents of significance testing doesn’t actually use this reasoning. They know that, in classical statistics, we don’t get to say “probably not H0”; we can only say that we reject or fail to reject H0 according to a procedure that has a given error rate, and the strength of our inference rests upon the error rate of the procedure.

But I strongly suspect most significance tests are performed and interpreted by people who have “p<0.05, therefore probably not H0" in their heads. And I think this is because the frequency interpretation of probability is not intuitive, except in the canned examples taught in intro stats classes. I think that most people treat probability as quantifying uncertainty, and so probabilities can apply to truth statements, e.g. H0. And since probabilities can apply to truth statements, the p-value can (and does!) tell us the probability of H0. I brought this up once on Mayo's blog and my interpretation of her response is that she doesn't agree that scientists typically think this way – I think she sees the natural way of applying probabilistic thinking to scientific questions as "there is some fixed truth out there in the world, and probabilistic statements cannot be made about this truth, they can only be made about data gathered by a random procedure." And so this whole brouhaha about people misinterpreting p-values and misinterpreting the results of significance tests comes down to the critics of significance testing unfairly assuming that users and consumers of significance tests don't really understand how they work. Hopefully I'm not misrepresenting her.

Needless to say, I don't think that (most) users and consumers of significance tests really understand how they work.

If I spill a glass of water on the grass does that count as wet enough?

These might seem like trolling questions but this sort of thing can matter in the high noise, highly variable and imprecisely measured situations we’re typically talking about.

Even in super precise contexts we need more than basic logic – that’s why we have mathematics! Most of the philosophy of science I’ve read seems to have a real obsession with starting from things like ‘H: Newton’s theory’. I find it hard to take anything that starts from that point too seriously.

In my view scientific theories are not (or almost never) simple Boolean propositions and hence don’t have simple truth values as such. Some people seem to freak out at this and think everything is about to go postmodern but I think that’s more like the argument that you can’t be good without god than anything too serious.

Yes, whether it rains or not is a discrete outcome, and for this, discrete models make sense. Similarly, I think discrete models make sense for some questions of the form, is Trait A associated with Gene B. But I don’t think discrete models make sense for questions of the form, is embodied a real thing.

You quote Mayo as saying, “With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho." That's all fine, but in just about every problem I've seen, I know ahead of time that Ho (the null hypothesis of zero effect and zero systematic error) is false. So inferring "not-Ho" doesn't really do anything for me.
As I've also said many times in various blog comments and elsewhere, I do think there are problems where Ho could be approximately true, in examples such as genetics or disease classification, examples of discrete models. But I don't think this reasoning applies in settings such as "the effects of early childhood intervention" or "the effects of power pose" or "different political attitudes at different times of the month" or "the effects of shark attacks on voting." These are settings where we have every reason to believe that there are real effects, but these effects will be highly variable, unpredictable, and difficult to measure accurately. Rejecting Ho tells me nothing in these settings.

]]>By: Coreyhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-573262
Fri, 29 Sep 2017 17:31:33 +0000http://andrewgelman.com/?p=36807#comment-573262You’re not wrong; there’s more than one extension of logic to values beyond just the elements of the set {true, false} that can be used for data analysis, each with its own version of extended modus tollens (but we all know which one is the best one, nudge nudge, nudge nudge, know what I mean, say no more).

A quibble: even in the probabilistic version the minor premise is just “not B” — the data are whatever is observed, by definition. That is, the p-value version of modus tollens goes,

Major premise: If H_0 holds then the probability of observing a realized value of X less than x is some value p close to 1.
Minor premise: X = x.
—
Therefore, H_0 is false or else something improbable has happened.

Me:
"But with that logic, all sorts of inferential methods could be used. If H0, then BF 0|y) = .99; therefore not H0.”

Am I wrong here?

Essentially nil-null hypothesis testing isn’t an expression of modus tollens, because modus tollens doesn’t include a probabilistic statement. “If A, then B. Not B, therefore not A” doesn’t really work when you have “If A, then probably B; probably not B, therefore probably not A”.
But if you generalized modus tollens to just be, generally, “If A, then probably B; probably not B, therefore probably not A”, then lots of inferential frameworks would permit modus tollens anyway. “If H0, then theta probably negative; theta probably not negative, therefore probably not H0”; and that’s a posterior statement regardless of NHST or significance testing.

]]>By: Sander Greenlandhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-573211
Fri, 29 Sep 2017 15:44:48 +0000http://andrewgelman.com/?p=36807#comment-573211Thanatos: Thanks for the detailed reply. I don’t see where we are disagreeing on matters of logic, and perhaps not even on matters of law. But as a test I’ll add these potentially controversial observations of mine:

1) I can’t fathom why some still defend having testifying (as opposed to consulting) experts hired by the opposing parties, which easily creates situations in which the majority of unbiased expert opinions get excluded because they don’t fall clearly enough on either side. The goal as I see it should be to de-bias expert testimony as much as possible and get a fair representation of what is out in the field, given that the triers of fact will not have the expertise to do so very effectively. Your Hand story seems to support my view of this problem. And court experts could help address the problem of absurd claims getting filed: the availability of court experts early on should help get more of these meritless cases thrown out early, and could discourage such filings if the plaintiff has to pay the cost of this process when complete lack of scientific merit is determined (and make such a determination more sound than a court could achieve). Each party might recommend experts to the court, but again all testifying experts need to be screened by all parties for COIs and other bias sources (such as connections to the parties).

2) Regarding Hill: Like P-values, his list takes a lot of misplaced blame for abuse in the hands of incompetent or biased users. To be sure the list is far from perfect (Mod Epi goes through the list critically) but like a chainsaw it can be used constructively (to remind one of relevant items to check) or for butchery (if taken as a set of necessary conditions, as I often see defense do). There have been many proposals to update it but none I’ve seen result in dramatic shifts, and in my view only support the idea that it was quite a nice summary for 1965 when relative risks of 10 were being contested by industries. This says to me such updates as needed are for extension to our current, far less certain RR of 2 or less era, not for philosophic subtleties. And that means modern causal analysis tools will enter into its application.

3) There is a fundamental asymmetry I’ve observed between defense and plaintiff lawyers in torts: Plaintiff lawyers search for cases they can win, which for the best means at the outset they hire consulting experts to judge whether the science could actually support causation, before entering the arduous and expensive litigation process. Defense is more reactive, having to defend in response to plaintiff claims and build a case against causation regardless of what the actual science shows (even if only to minimize final judgments or drag on the case to bankrupt plaintiffs, as I’ve seen happen). This asymmetry is incomplete: again, there are plenty of absurd claims filed but again the availability of experts for the court should help get more of these thrown out early, and discourage meritless filings if the plaintiff has to pay the cost of this process when that happens.

4) Given the inertia that has greeted proposals and mechanisms for court experts, we need effective if less ideal solutions to the current mess. The Reference Manual on Scientific Evidence is one and my impression is it has helped quite a bit, albeit in a limited sphere. I think it needs updating with coverage of cognitive biases, with guidelines to help courts detect not only bias in expert opinions, but also to help de-bias their own judgments.

That said, I was puzzled by your comment about helping with the Reference Manual on Scientific Evidence, insofar as I’m not one known to be shy about sharing my opinions and criticisms (Andrew even blogged humorously about that very fact) – so if they ask they’ll get plenty. Having been however at the 2003 San Diego conference on science and the law that fed into the latest edition and not contacted since, I won’t hold my breath for that.

Statistics should be about grasping the real uncertainties from what we observe and you seem to be requesting some certainty about that uncertainty (e.g. uniform physical rules, natural laws, something about general principles?) and very likely is just not possible.

Unlike commonly accepted accounting practices, adequate for this purpose Newtonian physics, etc. I do not think those are in the foreseeable future for statistical practice. But as I said earlier “That is what I see is being worked through currently in the statistical discipline” so I might be wrong.

On the other hand, science as hard as it is, does not need to meet the additional requirements of the courts – so I think it is best to start there. If we can sort out what constitutes minimally sound statistical inference in science then maybe that can be upgraded to meet those additional requirements. And maybe not.

However, the idea of a court appointed expert, perhaps to arbitrate between the competing experts seems a far more promising route. I think you have already convincingly pointed out that what ever sensible stuff a statistician might write – it will be naively or purposely be twisted into something very different that will do a lot of harm!

]]>By: Thanatos Savehnhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572544
Thu, 28 Sep 2017 23:57:06 +0000http://andrewgelman.com/?p=36807#comment-572544An excellent point (and one that resonates given my experience with the FAA and the capture of its employees at a local tower by an flight outfit that likes to perform 200′ flying and acrobatics over my house, a host of regulations notwithstanding). In most cases having dueling experts works fine. One expert says “the wreck was caused by speeding”, talks about skid marks, and there’s no debate about whether friction is a real phenomenon. The other says “given the damage to the impacted vehicle the speed at the time defendant applied his brake was less than the speed limit”, says F = M*A, and A. B. Hill is not summoned via Ouija board to say whether or not force applied to sheet metal can cause it to be dented.

Perhaps it’s this business of thinking about variable phenomena in a language more suited to discussing gravity, the charge on an electron, etc. that’s at the root of the problem. Maybe when Fisher read Student’s paper and saw how well his empirical test of 750 randomized N=4 finger length samples fit his newly discovered curve he thought “well, it ain’t E=Mc^2 but there is an underlying law here!” Whereupon he immediately fell into reification’s trap because he had no solid theory/philosophy of modeling. I’m hoping Andrew’s new book will devote a page or two (maybe it’s already in the old book but I’m not buying it to find out since the iDataAnalysis Model X is rumored to be coming out soon and I’m cheap) to the fundamental issue of “how (to paraphrase Student) may a practical man reach a definite conclusion from all this modeling you’re going on about, Andrew?”

the experiment was designed and controlled sufficiently well to rule out any plausible cause of a signal other than a gravity wave.

They definitely do put a lot of effort into that, to the point of publishing entire separate papers about it (which is good). But shouldn’t each “ruling out” of each alternative have some kind of probability associated with it? Then plug all that into Bayes’ rule and say “assuming nothing we didn’t check for is going on here is the probability it is a GW”.

Clearly they do not rely only on predictions of GR regarding inspiraling black holes, or they wouldn’t do this type of “reject background” analysis. Many other things must cause such signals that correlate across sites in their equipment as well, right?

]]>By: Noah Motionhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572530
Thu, 28 Sep 2017 23:29:15 +0000http://andrewgelman.com/?p=36807#comment-572530This may well be a case in which NHST is an appropriate tool. I don’t know for sure, but it wouldn’t surprise me if both (a) the null model in this case is a well-formulated quantitative model of the system in question without a gravity wave signal, and (b) the experiment was designed and controlled sufficiently well to rule out any plausible cause of a signal other than a gravity wave.
]]>By: Daniel Lakelandhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572514
Thu, 28 Sep 2017 23:00:56 +0000http://andrewgelman.com/?p=36807#comment-572514At the same time, imagine what damage could be done by “regulatory capture” of the list of “allowed” experts who can inform the court… I don’t think the solution of the problem of expert witnesses is to stop lawyers from hiring opposing ones, though I could imagine adding an additional one hired by the judge/court might be of help.
]]>By: Martha (Smith)http://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572456
Thu, 28 Sep 2017 21:31:54 +0000http://andrewgelman.com/?p=36807#comment-572456+1
]]>By: Thanatos Savehnhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572381
Thu, 28 Sep 2017 19:57:54 +0000http://andrewgelman.com/?p=36807#comment-572381I’ll let my copy of Modern Epidemiology and a 2+ decades-old redweld full of your papers (from which I have learned much and have been disabused of much) stand as testament that I meant no slight.

What I inartfully attempted to convey was frustration with (a) what Daniel wrote; the attempt by courts to turn your measures of a model’s uncertainty into certainty about a competing (and usually untested) conjecture; and, (b) the increasingly common claim sometimes made by people citing Rothman/Greenland (especially when we get to the execrable – The Court: “Counsel, let’s now go through the Bradford Hill causal criteria analysis” – phase of the proceedings) that either methods don’t matter or that no method is any more likely to ferret out false claims than any other. The result in at least two cases have been courts deciding that causation can be discovered and justice thereby done merely from a credentialed scientist’s assessment of biologic plausibility alone.

What I hoped to get from you assembled scientists was (recalling the witch in Monty Python’s The Holy Grail) agreement as to whether there exists some minimal level of testing of hypothetical statistical claims made against you that would if the test(s) was passed cause you to say “it’s a fair cop”. As for the other points you raised:

Paragraph 3 (yours): Since the ASA statement on p-values was published there have been more than 500 published legal opinions and orders that contain “statistically significant”, “confidence interval”, “statistical power”, etc. Children are being taken from their parents, prisoners executed, fortunes won or lost all on the basis of reasoning like this from a recent federal appellate court opinion: “The p-value quantifies the statistical significance of a relationship; the smaller the p-value the greater the likelihood that associations determined in a study do not result from chance.” I’m not saying you’re responsible for this mess. I’m saying you have the street cred to help clean it up. To that end, if you’re asked to help on the next version of the Reference Manual on Scientific Evidence please consider it.

Paragraph 4 (yours): Responsibility in the context of duty (along with the other aspects of wise judgments) was supposed to be parked in the definition of “legal causation” (which was grounded in public policy theory not too distant from that of public health) but courts began to ignore it 40+ years ago when they became convinced that but-for causation was all that was needed and NHST could find it. Thus the spectacle of the administration of justice via NHST. Thus, I’m in complete agreement.

Finally, as to your fifth paragraph, (because witches), you’ll be pleased to know that the famous jurist Learned Hand made your point in a law review article 116 years ago. In his survey of the use of expert witnesses at trials he covered a number of instances of their use before juries including in “The Witches Case” 6 Howell, State Trials, 697 – “Dr. Brown, of Norwich, was desired to state his opinion of the accused persons, and he was clearly of opinion that they were witches, and he elaborated his opinion by a scientific explanation of the fits to which they were subject.” Hand was troubled by the uses to which “science” had been put but more importantly troubled by the role of experts.

Expert witnesses were supposed to testify to “uniform physical rules, natural laws, or general principles, which the jury must apply to the facts.” That meant that the jury was largely displaced from its usual role of supplying the major premise (drawn from common sense and common experience) to which the admitted testimony of witness was applied but IFF they believe the expert witness. That in turn drew the focus of the lawyers to the credibility/likeability of the expert rather than the soundness of the claim he/she was making. This is how we wound up with (true story) a highly credentialed, bright, quick and deadly expert witness being paid $2 million per year NOT to testify against a certain party and group of lawyers. Anyway, here’s what Hand wrote in 1901 about paid experts:

“The expert becomes a hired champion of one side… Enough has been said elsewhere as to the natural bias of one called in such matters to represent a single side and liberally paid to defend it. Human nature is too weak for that; I can only appeal to my learned brethren of the long robe to answer candidly how often they look impartially at the law of a case they have become thoroughly interested in, and what kind of experts they think they would make, as to foreign law, in their own cases…. It is obvious that my path has led to a board of experts or a single expert, not called by either side, who shall advise the jury of the general propositions applicable to the case which lie within his province.” http://www.jstor.org/stable/pdf/1322532.pdf?refreqid=excelsior%3Ae67b3a059b1c64f4d63f7b6fbbc5018b

]]>By: Anenoeuoidhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572221
Thu, 28 Sep 2017 16:19:41 +0000http://andrewgelman.com/?p=36807#comment-572221Some people really will come up with or do anything to avoid actual science. This p-rep is just yet another layer of obfuscation that misses the point.

Take nobody’s word for it (don’t rely on argument from consensus/authority heuristics), which entails replicating each others work and requiring each other to make accurate and precise predictions about observations in the future. Those are the basic requirements of science, without them you have no functioning scientific community.

]]>By: Noah Motionhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572216
Thu, 28 Sep 2017 16:08:08 +0000http://andrewgelman.com/?p=36807#comment-572216many psychological science types think that the p-value from a single experiment can inform you about replicability

2) I certainly agree about the issues with political science and econ; as I’ve said before ( https://medium.com/@davidmanheim/the-good-the-bad-and-the-appropriately-under-powered-82c335652930 ) there are questions that can’t be answered with better statistical tools, because the samples involved are limited. That’s a fundamental limitation about questions where new samples cannot be generated. I just had a closely related discussion regarding a re-analysis of data about wars; http://oefresearch.org/blog/debate-continues-peace-numbers – they essentially used p-values to conclude that the evidence they re-analyzed doesn’t reject the null hypothesis that there is no change, thereby “arguing with” Pinker. This was based on previously collected data, and the method of considering p-values for a new, more complex model to ask whether the evidence supports a previous claim seems even more conceptually broken that the usual use of p-values.

3) Agreed.

4) I’m not saying it’s actually addressed to Journal Editors. I managed not to say it explicitly, but I was contrasting this with the Lakens paper (which I was Nth author on, and contributed a very very little bit towards,) which was addressed much more to Researchers. The idea that the choice of alpha should be justified is very closely related to your points about how to choose what to research – but there, we had even more (implicit) focus on how to choose your sample size and study design. The idea that one should abandon NHST has implications for study design, but you more closely focused on what to study based on interpreting previous research, and how to analyze the data. That said, obviously, this is all conceptually related; designing good studies to maximize posterior surprise requires both interpretation of previous results to choose what to study, and implicitly choosing something like an SDT loss function to figure out what to look at, and how hard. What it shouldn’t involve, but currently does, is figuring out what sample size will get you p<0.05.

]]>By: Anenoeuoidhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-572126
Thu, 28 Sep 2017 14:26:18 +0000http://andrewgelman.com/?p=36807#comment-572126More LIGO, each paper that gets published on the topic, the misunderstanding of statistical significance seems to gets more brazen:

Both pipelines identiﬁed GW170814 with a Hanford-Livingston network SNR of 15, with ranking statistic values from the two pipelines corresponding to a false-alarm rate of 1 in 140 000 years in one search [39, 40] and 1 in 27 000 years in the other search [41–45, 57], clearly identifying GW170814 as a GW signal. The difference in signiﬁcance is due to the different techniques used to rank candidate events and measure the noise background in these searches, however both report a highly signiﬁcant event.

Here they come right out and say, “We rejected the background model with p-value = 1/140,000. This number is very small, therefore this event was a GW signal (from two black holes of specific size, etc colliding).” Given that this is one (if not the) most high profile science project of the last few years, I don’t think any abandonment is coming soon.

Regarding your second paragraph: I think philosophical arguments have their place, as there are many ways to try to understand the world. I have spent most of my career keeping quiet about foundations and just doing stuff, and writing books demonstrating how to do stuff. When writing Bayesian Data Analysis and then again while writing Data Analysis Using Regression and Multilevel Models, I consciously avoided arguments about what methods are better or worse, instead focusing on good practice and the derivations of such methods. And had I spent the past thirty years doing nothing but screaming about hypothesis testing, I think I’d be a lot less happy and the world would be a poorer place.

Here’s what happened: I’ve had problems with null hypothesis testing for a long time, but for a long time my way of expressing this view was simply to not use such methods except in the rare cases where they seemed appropriate to me. I also explained this position in some theoretical articles, such as my 1995 paper with Rubin on avoiding model selection, my 2000 paper with Tuerlinckx on type M and S errors, my 2003 paper on exploratory data analysis and goodness of fit testing, and my 2011 paper with Hill and Masanao and multiple comparisons.

Incidentally, one of the cases where I did find some version of null hypothesis significance testing to be useful was in exploring problems with a model used for medical imaging analysis. I felt I got something from measuring the distance of the data to a model that I’d been using. This was in my Ph.D. thesis and it motivated my later work on posterior predictive checking, which started with a focus on Bayesian p-values but has since moved toward graphical visualizations.

Anyway, my crusade against NHST in recent years came by accident, as I happened to encounter various bad published papers, starting with those of Satoshi Kanazawa and continuing with the well-publicized work of Daryl Bem and all the rest that we’ve been hearing so much about, and I started to realize that the problems with all this work was not just a bunch of individual data-processing mistakes of the Wansink variety, but a larger problem that when classical “p less than 0.05” methods are used in an attempt to extract signal from extremely noisy data, that what will be extracted is something close to pure noise. I got involved in some particular disputes and then it seemed that it made sense to think harder about the general issues. Working on all of this has deepened my own understanding of these problems, and I feel that my work in this area has been a contribution, I hope in some part by motivating researchers to think more carefully about data quality rather than thinking that, just because they have a randomized experiment or a regression discontinuity or whatever, that they can just push the buttons, grab statistically significant comparisons, and claim victory.

“Circular philosophical arguments” have necessarily played some role in these discussions, in part because classical NHST methods are supported by some theory. The math isn’t wrong but the assumptions don’t really apply—at least, not in many of the sorts of application areas where I see those methods being used—and so it’s kinda necessary to explain why the assumptions don’t apply, to give a sense that, yes, in some settings NHST can be theoretically supported but not in these sets of problems.

]]>By: Björnhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-571840
Thu, 28 Sep 2017 08:38:44 +0000http://andrewgelman.com/?p=36807#comment-571840If that were the main problems, then we’d be pretty okay in some (many?*) randomized experiments. If you randomized and did an exact randomization based rank-test (using a parametric analysis honestly does not worry me that much, but just for the sake of the argument…) for your single pre-specified outcome, you can certainly make sure that you p-values don’t go below 0.05 in no more than 5% of such experiments under the null hypothesis.

* I.e. those where no intercurrent events mess up your randomized comparison and make the interpretation harder. The problems are things like when a patient dies in a randomized clinical trial when you said really wanted to compare blood pressure at the end of the trial between treatment groups.

]]>By: Shravanhttp://andrewgelman.com/2017/09/26/abandon-statistical-significance/#comment-571799
Thu, 28 Sep 2017 05:46:22 +0000http://andrewgelman.com/?p=36807#comment-571799Partly because many psychological science types think that the p-value from a single experiment can inform you about replicability in future experiments.

The Science article on reproducibility encouraged that kind of thinking (by saying that the lower the p-value in the original study, the more reproducible the result tended to be), and now I see more and more psychologists writing the above statement as if it were a fact.