A paper about P-Curve by Simmons, Nelson, and Simonsohn

In this article, authors Joseph Simmons, Leif Nelson, and Uri Simonsohn propose a way to distinguish between truly significant findings and false positives resulting from selective reporting and specification searching, or p-hacking.

P values indicate “how likely one is to observe an outcome at least as extreme as the one observed if the studied effect were nonexistent.” As a reminder, most academic journals will only publish studies with p values less than 0.05, the threshold for significance.

Some researchers use p-hacking to “find statistically significant support for nonexistent effects,” allowing them to “get most studies to reveal significant relationships between truly unrelated variables.”

The p-curve can be used to detect p-hacking. The authors define this curve as “the distribution of statistically significant p values for a set of independent findings. Its shape is a diagnostic of the evidential value of that set of findings.”

In order for p-curve inferences to be credible, the p values selected must be:

associated with the hypothesis of interest,

statistically independent from other selected p values, and

distributed uniformly under the null.

It is also important to clarify that the p-curve assesses only reported data and not the theories that they are testing. Similarly, it’s important to keep in mind that, if a set of values is found to have evidential value, it doesn’t automatically imply internal or external validity.

Using the p-curve to detect p-hacking is fairly straightforward. If the curve is right-skewed as in the chart to the right in the figure below, there are more low (0.01s) than high (0.04s) significant p values, suggesting truly significant p values. When non-existent effects are studied (i.e. a study’s null hypothesis is true), all p values are equally likely to be observed, thus producing a uniform curve or a straight line. In the figure below, each chart incorporates a uniform curve that is dotted and red for comparison. Curves that are left-skewed however, as in the chart to the left in the figure below, indicate more high p values than low one; p-hacking has likely occurred.

The above figure displays the results of the authors’ demonstration of the p-curve through the analysis of two sets of findings taken from the Journal of Personality and Social Psychology (JPSP). They hypothesized that one set was p-hacked, while the other was not. In the set in which they suspected p-hacking, they realized that the authors of the publication reported results only with a covariate. While there is nothing wrong with including covariates in study’s design, many researchers will include one only after their initial analysis (without the covariate) was found to be insignificant.

Simmons, Nelson, and Simonsohn provide guidelines to follow when selecting studies to analyze with the p-curve:

Create a rule. – Authors should decide in advance which studies to use.

Disclose the selection rule.

Robustness to resolutions of ambiguity – If it is unclear whether or not a study should be included, authors should report results both with and without that study. This allows readers to see the extent of the influence of these ambiguous cases.

Replicate single-article p-curves. – Because of the risk of cherry-picking single articles, the authors suggest a direct replication of at least one of the studies in the article to improve the credibility of the p-curve.

In addition to these guidelines, Simmons, Nelson, and Simonsohn also provide five steps to ensure that the selcted p-values meet the three selection criteria we mentioned earlier:

Identify researchers’ stated hypothesis and study design.

Identify the statistical result testing the stated hypothesis.

Report the statistical results of interest.

Recompute precise p values based on reported test statistics – This has been made easy through an online app, which you can find at http://p-curve.com/

Report robustness results of the p values to your selection rules.

As with anything, the p-curve is not one hundred percent accurate one hundred percent of the time. The validity of the judgments made from a p-curve may depend on “the number of studies being p-curved, their statistical power, and the intensity of p-hacking”. There isn’t much concern over cherry-picking p-curves to ensure the result of a lack of evidential value. However, this can be prevented simply with the disclosure of selections, ambiguity, sample size, and other study details.

Additionally, there are a few limitations with the p-curve. First, it “does not yet technically apply to studies analyzed using discrete test statistics” and is “less likely to conclude data have evidential value when a covariate correlates with the independent variable of interest.” It also has a hard time detecting confounding variables; if there is a real effect, but also mild p-hacking, it usually won’t detect the latter.

Simmons, Nelson, and Simonsohn conclude that, with the examination of a distribution of p values, one will be able to identify whether selective reporting was used or not. What do you think about the p-curve? Would you use this tool?