Pages

Thursday, July 27, 2017

Overvaluing P-values

Seventy-two "big names in statistics want to shake up [the] much maligned P-value." 1/ These academic researchers give the following one-sentence summary of their proposal to change the way scientific articles are written:

We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005. 2/

The President’s Council of Advisors on Science and Technology (PCAST) effectively invoked the current "default P-value" of 0.05 as a rule for admitting scientific evidence in court. 3/ In light of this new (but not novel) call for reducing the conventional p-value, one might think that PCAST was being too generous toward forensic-science identifications. After all, if 5% is ten times the value that should be used to declare differences “statistically significant” in scientific research, then it seems way too large as a limit for what PCAST called “scientific reliability” for courtroom use. But that conclusion would rest on a misunderstanding of the objective and nature of the proposal to change the nomenclature for p-values.

The motivation for moving from 0.05 to 0.005 is “growing concern over the credibility of claims of new discoveries based on ‘statistically significant’ findings.” The authors argue that regarding 0.05 as proof of a real difference (a true positive) is “a leading cause of non-reproducibility” of published discoveries in scientific fields that could easily be corrected by referring to findings with p-values between 0.005 and 0.05 as “suggestive” rather than “significant.” The problem with the verbal tag of “statistically significant” (p < 0.05) is that, in comparison to the state of scientific research ninety years ago when Sir Ronald Fisher floated the 0.05 level, “[a] much larger pool of scientists are now asking a much larger number of questions, possibly with much lower prior odds of success,” resulting in too many apparent discoveries that cannot be replicated in later experiments.

Not only is the group of 72 addressing the perils of the p-value in a different context, but their proposal is not intended as a bright-line rule for deciding what to publish. They explain:

We emphasize that this proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labeled as suggestive evidence.

So too, “[r]esults that do not reach the threshold for statistical significance (whatever it is) can still be important” in litigation, and the desire to shake things up in the research community does not reveal much about appropriate standards for admissibility in court.

However, PCAST is on firm ground in emphasizing the need to present forensic-science findings without overstating their probative value. The 72 researchers focus on probative value when they discuss a “more direct measure of the strength of evidence.” They suggest that a “two-sided P-value of 0.05 [often] corresponds to Bayes factors ... that range from about 2.5 to 3.4.” Such evidence, they note, is weak. In contrast, they defend the "two-sided P-value of 0.005" in part on the ground that it "corresponds to Bayes factors between approximately 14 and 26." As such, it "represents ‘substantial’ to ‘strong’ evidence according to conventional Bayes factor classifications."

Forensic scientists who advocate describing the strength of evidence rather than only false-positive rates are more demanding. They usually consider Bayes factors between 10 and 100 to constitute "moderate" rather than “strong” evidence. 4/