Friday, February 27, 2015

Throughout the history of psychological
science, there has been a continuing debate about which statistics are used and
how these statistics are reported. I distinguish between reporting statistics, and interpreting
statistics. This is important, because a lot of the criticism on the statistics
researchers use comes from how statistics are interpreted, not how they are
reported.

When it comes to reporting statistics, my
approach is simple: The more, the merrier. At the very minimum, descriptive
statistics (e.g., means and standard deviations) are required to understand the
reported data, preferably complemented with visualizations of the data (for
example in online supplementary material). This should include the sample sizes
(per condition for between subject designs), and correlations between dependent
variables in within subject designs. The number of participants per condition,
and especially the correlation between dependent variables, are often not
reported, but are necessary for future meta-analyses.

If you want to communicate the probability
the alternative hypothesis is true, given the data, you should report Bayesian
statistics, such as Bayes Factors. But in novel lines of research, you might
simply want to know whether your data is surprising, assuming there is no true
effect, and choose to report p-values
for this purpose.

But there is a much more important aspect
to consider when reporting statistics. Given that every study is merely
a data-point in a future meta-analysis, all meta-analytic data should be
presented to be able to include the data in future meta-analyses.

What is meta-analytic data?

What meta-analytical data is, depends on
the meta-analytical technique that is used. The most widely known meta-analytical
technique is the meta-analysis of effect
sizes (often simply abbreviated as meta-analysis). In a meta-analysis of
effect sizes, researchers typically combine standardized effect sizes across
studies, and provide an estimate of, and confidence intervals around, the
meta-analytic effect size. A number of standardized effect sizes exist to
combine effects across studies that use different measures, such as Cohen’s d, correlations, and odds-ratios (note
that there are often many different ways to calculate these types of effect
sizes).

Recently, novel meta-analytical techniques
have been developed. For example, p-curve
analysis uses the test statistics (e.g., t-values,
F-values, and their degrees of
freedom) as input, and analyses the distribution of p-values. This analysis can indicate the p-value distribution is uniformly distributed (as expected when the
null-hypothesis is true), or that the p-value
distribution is right-skewed (as expected when the alternative hypothesis is
true). P-curve analysis has a number
of benefits, of which the most noteworthy is that it is performed on p-values below 0.05. Due to publication
bias, non-significant effects are often not shared between researchers, which
is a challenge for meta-analyses of effect sizes. P-curve analysis does not require access to non-significant results
to evaluate the evidential value of a set of studies, which makes it an
important complementary meta-analytical technique to meta-analysis of effect
sizes. Similarly, Bayesian meta-analytical techniques often rely on test statistics,
and not on standardized effect sizes.

If researchers want to facilitate future meta-analytical
efforts, they should report effect sizes and statistical tests for the comparisons
they are making. Furthermore, since you should not report point estimates
without indicating the uncertainty in those point estimates, researchers need
to provide confidence intervals around effect size intervals. Finally, when the
unstandardized data can clearly communicate the practical relevance of the
effect (for example, when you measured your dependent variable in scales we can
easily interpret, such as time or money) researchers might simply choose to
report the mean difference (and accompanying confidence interval).

To conclude, the best recommendation I can
currently think of when reporting statistics is to provide means, standard
deviations, sample sizes (per condition in between designs), correlations
between dependent measures in within designs, statistical tests (such as a t-test), p-values, effect sizes and their confidence intervals, and Bayes
Factors. For example, when reporting a Stroop effect, we might write:

Obviously the best way to prevent
discussion about which statistics you report and to facilitate future meta-analyses is to share your raw data –
online repositories have made this so easy, there no longer a good reason not
to share your data (except for some datasets where there are certain ethical
and privacy related aspects to consider).

Which statistics you interpret is a very different question, which I personally find
much less interesting, given that the interpretation of single studies is just
an intermittent summary while the field waits for the meta-analysis. A good
approach is to interpret all statistics you report, and to trust your conclusions
most when all statistical inferences provide converging support for your conclusion.

Tuesday, February 17, 2015

An extended version of this blog post is now in press at PeerJ.TL;DR version: De Winter and Dodou (2015) analyzed
the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values
between 0.041-0.049 in recent decades’ which 'suggests (but does not prove) questionable research
practices have increased over the past 25 years'. I show the changes in the ratios of p-values over the years between 0.041-0.049 are better explained by a model of
p-value distributions that assumes the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias over the years (cf. Fanelli, 2012), which has led to a relative decrease of 'marginally significant' p-values in the literature (instead of an increase in p-values just below 0.05).
I (again, see Lakens, 2014) explain why researchers analyzing large numbers of
p-values in the scientific literature need to develop better models of p-value
distributions before drawing conclusion about questionable research practices. I want to thank De Winter and Dodou for sharing their data,
assisting in the re-analysis, and reading an earlier version of this draft (to
which they replied they were happy to see other researchers used the data to
test alternative explanations, and that they did not see any technical mistakes
in this blog post).

In recent
years researchers have become more aware of how flexibility during the
data-analysis can increase false positive results (e.g., Simmons, Nelson, &
Simonsohn, 2011). If the true Type 1 error rate is substantially inflated
because researchers analyze their data until a p-value smaller than 0.05 is observed, this might substantially
decrease the robustness of scientific knowledge. However, as Stroebe and Strack
(2014, p. 60) have pointed out: “Thus
far, however, no solid data exist on the prevalence of such research practices”.
Some researchers have attempted to provide some indication of the prevalence of
questionable research practices by analyzing the distribution of p-values in the published literature.
The idea is that questionable research practices lead to ‘a peculiar prevalence
of p-values just below 0.05’
(Masicampo & Lalande, 2012) or the observation that ‘”just significant”
results are on the rise’ (Leggett, Loetscher, & Nichols, 2013).

Despite the
attention grabbing titles of these publications, the reported data does not afford the strong conclusions these researchers have drawn. The
observed pattern of a peak of p-values
just below 0.05 in Leggett et al (2013) does not replicate in other collected p-value distributions for the same
journal in later years (Masicampo & Lalande, 2012), in psychology in
general (Kühberger, Fritz, & Scherndl, 2014), or in scientific journals in
general (De Winter & Dodou, 2015). The peak in p-values observed in Masicampo & Lalande (2012) is only
surprising compared to an incorrectly modelled p-value distribution that ignores publication bias and its effect
on the frequency of p-values above
0.05 (Lakens, 2014).

Recently, De
Winter and Dodou (2015) have contributed to this emerging literature on p-value distributions and concluded that
there is a ‘surge of p-values between
0.041-0.049 in recent decades’. They improved upon earlier approaches to
analyze p-value distributions by
comparing the percentage of p-values
over time (from 1990-2013). Two observations in the data they collected could
seduce researchers to draw conclusions about a rise of p-values just below a significance level of 0.05. The first
observation is a much stronger rise in p-values
between 0.041 and 0.049 than in p-values
between 0.051-0.059. The second observation is that the percentage of p-values that falls between 0.041-0.049
has increased more from 1990 to 2013 than the increase in the percentage of p-values between 0.01-0.09, 0.011-0.019,
0.021-0.029, and 0.031-0.039 over the same years (the authors also analyze p-values with 2 digits (e.g., p = 0.04), which reveal similar patterns, but here I focus on the three digit data, which included p-values between for example 0.041-0.049 because trailing zeroes (e.g., p = 0.040) are rarely reported). The authors (2015, p. 37) conclude
that: “The fact that p-values just
below 0.05 exhibited the fastest increase among all p-value ranges we searched for suggests (but does not prove) that
questionable research practices have increased over the past 25 years.”

I will
explain why the data does not provide any indication of an increase in
questionable research practices. First, I will discuss how the difference in
the increase in p-values just below
0.05 and just above 0.05 is due to publication bias, where (perhaps
surprisingly) p-values just above
0.05 are becoming relatively less likely to appear in the abstracts of
published articles over the years. Second, I will explain why the relatively
high increase in p-values between
0.041-0.049 over the years can easily be accounted for by a decrease in the
average power of studies, but is unlikely to emerge due to an inflated Type 1
error rate due to questionable research practices. I want to explicitly note
that it was possible to provide these alternative interpretations of the data
mainly because the authors shared all data and analysis scripts online (http://dx.doi.org/10.7717/peerj.733/supp-7)
and were furthermore extremely responsive and helpful in answering a number of
questions I had. While I criticize their interpretation of data, I applaud
their adherence to open science principles (their Matlab code is an excellent
example of reproducible statistics), which greatly facilitates cumulative
science.

As I have
discussed before (Lakens, 2014), it is essential to accurately model p-value distributions before drawing
conclusions about p-values extracted
from the scientific literature. Statements about p-value distributions require a definition of four parameters. First,
researchers should specify the number of studies where H0 is true, and the number
of studies where H1 is true. Second, researchers need to estimate the average power
of the studies (or the average power of multiple subsets of studies, if
heterogeneity in power is substantial). Third, the true Type 1 error rate and
any possible mechanisms through which the error rate is inflated should be
specified. And finally, publication bias, and a model of how the p-value distribution is affected by
publication bias, should be proposed. It is important to look beyond simplistic
comparisons between p-values just
below 0.05 and p-values in other
locations in the p-value distribution
outside the scope of an explicit model of the four parameters that determine p-value distributions.

De Winter
and Dodou (2015) show there is a relatively stronger increase in p-values between 0.041-0.049 than
between 0.051-0.059 (see for example Figure 9, reproduced below). The data is
clear, but the reason for this difference is not. Are p-values below 0.05 increasing more, or are p-values above 0.05 increasing less? A direct comparison is
difficult, because the percentage of papers reporting p-values below 0.05 can increase due to an increase in p-hacking, but also due to an increase
in publication bias. If publication bias increases, and people report less
non-significant results, the percentage of papers reporting p-values smaller than 0.05 will also
increase, even if there is no increase in p-hacking.
Indeed, Fanelli (2012) has shown negative results have been disappearing from
the literature between 1990-2007, which would explain the relative differences
in p-values between 0.041-0.049 and
0.051-0.059 observed by De Winter and Dodou (2015).

We can
examine the alternative explanation that the relative differences observed are due
to publication bias increasing, instead of due to an increase in p-hacking, by comparing the relative
differences between p-values between
0.031-0.039 and 0.041-0.049 over the years on the one hand, and 0.051-0.059 and
0.061-0.069 on the other hand. If there is an increase in p-hacking, the biggest differences should be observed below 0.05
(in line with the idea of a surge of p-values
between 0.041-0.049.
However, there are reasons to assume the biggest difference might occur in p-values just above 0.05. As Lakens
(2014) noted, there seems to be some tolerance for p-values just above 0.05 to be published, as indicated by a higher
prevalence of p-values between
0.051-0.059 than would be expected based on the power of statistical tests and
an equal reduction of all p-values
above 0.05. If publication bias becomes more severe, we might expect a
reduction in the tolerance for p-values
just above 0.05, which would lead to the largest changes in ratios above 0.05. The spreadsheets and datafiles used to re-analyze and reconstruct the data is available on the OSF.

Across the
three time periods (1990-1997, 1998-2005, and 2006-2013) the ratio of p-values in the 0.03 range to p-values in the 0.04 range is pretty
stable: 1.13, 1.09, and 1.11, respectively. The ratio of p-values in the 0.05 range to p-values
in the 0.06 range is surprisingly large to begin with (given that purely based
on power, p-values between 0.051-0.059 and 0.061-0.069 should occur
approximately equally often in the literature), and shows a surprisingly large
reduction over the years: 2.27, 1.94, and 1.79, respectively. The only larger
reduction in ratios is observed for p-values
between 0.001-0.009 (which is most likely due to differences in power over the
years, as will be explained below). This surprisingly large change in ratios over
time for p-values between 0.051-0.059
indicates that instead of a surge of p-hacking,
publication bias has become more pronounced over the years for p-values just above the 0.05 level,
which causes p-values just above 0.05
to increase relatively less over the years than p-values in all other bins (except for p-values below 0.009).

This might be
explained by the idea that where p-values
between .051-0.59 (or 'marginally significnt' p-values) were more readily interpreted as support for the hypothesis
in 1990-1997 than in 2005-2013. This idea is
speculative, but seems likely given the increase in publication bias over the
years (Fanelli, 2012). It should be noted that p-values just above the 0.05 level are still more frequent than can be explained just by the average power of the
tests and publication bias that is equal for all p-values above 0.05 (cf. Lakens, 2014). In other words, this data
is in line with the idea that publication bias is still slightly less severe
for p-values just above 0.05, even
though this benefit of p-values just
above 0.05 has become smaller over the years.

This seems
to be the driving force for the differences between p-values in the 0.041-0.049 range and p-values
in the 0.051-0.059 range, reported by De Winter and Dodou (2015, e.g., Figures
9 and 10). To conclude, these observed differences provide no indication for a
surge of p-values between 0.041-0.049
over the years due to an increase in questionable research practices.

How changes in average
power over the years affect ratios of p-values
below 0.05

The title
of the article, “A surge of p-values
between 0.041-0.049” is based on the observation that the ratio of p-values between 0.041-0.049 increases
more than the ratio of p-values
between 0.031-0.039, 0.021-0.029, and 0.011-0.019. There are no statistics
reported to indicate whether these differences in ratios are statistically
significant, nor are effect sizes reported to indicate whether the differences
are practically significant (or justify the term ‘surge’), but the ratios do
increase as you move from bins of low p-values
between 0.001-0.009 to bins of high p-values
between 0.041-0.049. Figure 23 reports the ratios of percentages of p-values in 1990 and 2013 for a range of
search terms. Most interesting for the current purpose are the p-values between 0.001 and 0.049.

The first
thing to understand is why these ratios are not close to 1. The reason is that
there is a massive increase in the percentage of papers in which p-values are reported over the years. As
De Winter & Dodou (2015, p. 15) note: “In
1990, 0.019% of papers (106 out of 563,023 papers) reported a p-value between
0.051 and 0.059. This increased 3.6-fold to 0.067% (1,549 out of 2,317,062
papers) in 2013. Positive results increased 10.3-fold in the same period: from
0.030% (171 out of 563,023 papers) in 1990 to 0.314% (7,266 out of 2,317,062
papers) in 2013.” This is not just an increase in the absolute number of
reported p-values in abstracts (in
which case the ratios could still be 1) but a relative 10.3-fold increase in
how often p-values end up in
abstracts. De Winter & Dodou (2015) demonstrate p-values are finding their way into more and more abstracts, which
points to a possible increase in the overreliance on null-hypothesis testing in
empirical articles. This is an important contribution to the literature, even
when other claims about an increase in questionable research practices would
not hold (also, the huge increase in the term 'paradigm shift' in abstracts over time is quite telling).

How can
these differences between the ratios across the 5 bins below 0.05 be explained
by a model of p-value distributions
that consists of the ratio of true to false effects examined, power, the Type 1
error rate, and publication bias? We can only explain the relative differences
between the ratios over the different bins of p-values if we allow at least one of the parameters of the model to
the change over time. We can ignore publication bias, assuming all disciplines
that report p-values in abstracts use
α = 0.05 (this is not true, but we can assume it applies to the majority of
articles that are analyzed). The two remaining possibilities are a change in
the average power of studies over time, and an inflated Type 1 error rate over
time, such as an increase in questionable research practices in the
literature.

If we
ignore Type 1 errors, we can relatively easily reconstruct the observed data
purely based on differences in the average power across the years. I’m not
arguing the numbers in this re-construction reflect the truth. However, they show
it is possible to model the ratios observed by De Winter & Dodou (2015) under
the assumption that power differs from 1990 to 2013. For example, if we assume
average power was 55% in 1990, and 42% in 2013, we can expect to observe the p-value distribution across the 5 bins
as detailed in the table below, with 29.855% of the p-values falling between 0.001 and 0.009 in 1990, but only 19.926%
of p-values falling between
0.001-0.009 n 2013 (which most likely explains the large differences in ratios
between 0.001-0.009 discussed earlier). This is just the p-value distribution as a function of the power of the tests.

If we
incorporate the fact that the percentage of p-values
reported in the abstract has increased by 10% over the years (column 2 and 3 in
Table 2 below), and use as total studies in 1990 563023, and as total studies
in 2013 2317062 (taken from De Winter & Dodou, 2015) then we should expect
the total number of observed p-values
in 1990 and 2013 as displayed in columns 4 and 5 below. These numbers mirror
the observed frequencies (columns 4 and 6) by De Winter and Dodou (2015).

Table 2. Absolute
number of reconstructed and observed p-values
between 0.001-0.049 from 1990 to 2013.

% p-values in
abstract

% p-values in
abstract

reconstructed p-values
1990

reconstructed p-values
2013

observed p-values
1990

observed p-values
2013

p0.001-p0.009

0.01

0.1

1681

46170

1770

44970

p0.011-p0.019

0.01

0.1

481

16728

462

14885

p0.021-p0.029

0.01

0.1

316

11725

268

10630

p0.031-p0.039

0.01

0.1

238

9210

240

9108

p0.041-p0.049

0.01

0.1

191

7646

178

8250

When we
calculate the ratios of the observed p-values,
we see in Table 3 they approach the general pattern of the ratios observed by
De Winter and Dodou (2015). The reconstruction is not perfect, for a number of
reasons. First of all, there is very little data from 1990, which will lead to
substantial variation between expected and observed frequencies for any model
(the fit of the model increases for comparisons between years where there is
more data available). For example, the fact that the difference in the
percentage of p-values in the
0.021-0.029 bin from 1990 to 2013 is larger than for p-values in the 0.031-0.039 bin is only true in 1990 and 2008, but
is reversed (as predicted by a model of p-value distributions where power
changes over time) in the remaining 21 comparisons of 2013 with each preceding
year.

Table 3.
Ratios of reconstructed and observed p-values
between 0.001-0.049 from 1990 to 2013.

reconstructed ratio N/T 1990

reconstructed ratio N/T 2013

reconstructed 1990/2013 Ratio

observed ratio N/T 1990

observed ratio N/T 2013

observed 1990/2013 Ratio

p0.001-p0.009

0.306

1.993

6.674

0.315

1.945

6.17

p0.011-p0.019

0.085

0.722

8.454

0.082

0.644

7.83

p0.021-p0.029

0.056

0.506

9.017

0.048

0.460

9.63

p0.031-p0.039

0.042

0.398

9.417

0.043

0.394

9.21

p0.041-p0.049

0.034

0.330

9.740

0.032

0.367

11.28

Similarly,
when comparing 2013 to each of the 23 preceding years, the ratio is higher for p-values between 0.041-0.049 than for
0.031-0.039 in 12 out of 23 comparisons – only just more than 50% of the time,
which can hardly be called a ‘surge’. The model based on power differences
predicts that ratios for p-values
between 0.031-0.039 should be very similar to those between 0.041-0.049. Given
the small percentages of articles that report p-values and the variation inherent in observed p-value distributions, it is not
surprising the ratios for 0.041-0.049 are only just more than 50% likely to be
higher than those for p-values
between 0.031-0.039. This observation is more difficult to explain based on the
idea that questionable research practices have increased, which typically
assumes p-values between 0.041-0.049
increase more strongly than p-values
between 0.031-0.039 (e.g., Leggett et al., 2013; Masicampo & Lalande,
2012).

Obviously
this model is too simplistic. It does not include any Type 1 errors, and it
assumes homogeneity in the power of the performed tests. We can be certain
power varies substantially across studies and research disciplines, and we can
be certain there are a number of Type 1 errors in the literature. For the
current purpose, which is to demonstrate the observed pattern can be
reconstructed by assuming the average power has changed over time, a more advanced
model is not required, but future attempts to provide support for an increase
in Type 1 errors, or attempts to calculate average effect sizes based on p-value distributions (e.g., Simonsohn,
Nelson, & Simmons, 2014) need to develop more detailed models of p-value distributions.

Let’s assume
the average power has not changed over time, and try to reconstruct the
observed ratios by changing the Type 1 error rates. As long as the Type 1 error
rates are the same for each bin of p-values,
the ratios equal the overall increase in p-values
reported in abstracts over time. To reconstruct the ratios as observed by De
Winter and Dodou (2015), we need to assume p-hacking
leads to a stronger increase in higher p-values
than in lower p-values. Although this
is a reasonable assumption under many types of p-hacking, it turns out to that the specific pattern of inflated
Type 1 error rates required to reconstruct the observed ratios in not very
likely to emerge in real life.

To simulate
the impact of questionable research practices, we need to decide upon the ratio
of studies where H0 is true and studies where H1 is true, and the exact
increase in Type 1 error rates for each bin of p-values below 0.05. Type 1 errors come exclusively from analyzing
results of studies where H0 is true (p-hacking
when H1 is true inflates the effect size estimate, and thus can be seen as an
incorrect way to increase the power of a test). In the calculations below,
power is kept constant, but p-hacking
is introduced. This is the equivalent of the true power of studies reducing
over the years, which is exactly compensated by an inflated Type 1 error rate.

The
observed ratios by De Winter & Dodou (2015) show the ratio is the smallest
for p-values between 0.001-0.009, and
substantially higher for p-values
between 0.011 and 0.049, with a relatively small increase in these 4 bins. This
pattern can be reproduced just based on inflated Type 1 errors, but the required
increase in Type 1 error rates over the 5 bins is very unlikely to occur when p-hacking.

The higher
the average power of statistical tests, the more frequently small p-values will be observed if there is a
true effect. This means there are more p-values
between 0.021-0.029 than between 0.041-0.049 whenever the power is larger than
0. Without p-hacking, the number of
Type 1 errors in each bin (e.g., between 0.001 and 0.009) should be 0.8% (it is
1% between 0 and 0.01). If we assume this was the situation in 1990 (which is a
conservative, albeit unlikely, estimate), the Type 1 error rates need to be
increased to higher levels to reproduce the observed ratios, after selecting
the average power of the studies, and the ratio of studies where H0 is true and
H1 is true. It becomes extremely difficult to reconstruct the observed absolute
numbers and ratios.

One attempt
to model to reconstruct the ratios (but not the absolute values) is presented
in Table 4. The ratio of studies where H0 is true to studies where H1 is true
is set to 1, and the average power is assumed to be 57.5%. The Type 1 error
rate inflation over time is substantial, and the difference in the increase
over the bins is not very typical, with a practically equal increase between
0.021-0.049. To achieve the ratios observed by De Winter & Dodou (2015) for
comparisons between 2013 and later years than 1990, the Type 1 error rate even needs
to be inflated more strongly for p-values
between 0.021-0.029 than for p-values
between 0.041-0.049. Such a pattern of Type 1 error rate inflation is
practically difficult to achieve, because questionable research practices (such
as performing multiple analyses on the same data with different outlier
criteria) produce a p-value
distribution where higher p-values
are observed more frequently than smaller p-values.
Thus, although it is not impossible to achieve the observed ratios purely by p-hacking (although it is very
challenging to reconstruct both ratio’s and absolute numbers), the required Type
1 error rate inflation over the 5 bins of p-values
is unlikely to occur in real life.

Table 4. Absolute
number of reconstructed Type 1 errors between 0.001-0.049 from 1990 to 2013.

1990 true effects

2013 true effects

Type 1 error rate 1990

1990 Type 1 errors

Type 1 error rate 2013s

2013 Type 1 errors

Reconstructed 1990/2013 Ratio

p0.001-p0.009

1814

47784

0.008

90

0.015

4449

6.66

p0.011-p0.019

492

12959

0.008

90

0.020

5932

7.89

p0.021-p0.029

319

8399

0.008

90

0.025

7415

9.40

p0.031-p0.039

238

6260

0.008

90

0.025

7415

10.14

p0.041-p0.049

189

4988

0.008

90

0.027

8008

11.30

To
summarize, we can easily reconstruct the observed ratios by assuming a
relatively small decrease in power over the years (e.g., from 55% to 42%). Such
an assumption could be reasonable, as long as new research areas, or strongly
growing research areas, have lower power than average. One example of such a
research area is neuroscience, with a median power estimated to be as low as
21% (Button et al., 2013). On the other hand, while increases in Type 1 error
rates can be used to reconstruct the observed ratios, the pattern of inflated
Type 1 errors across the 5 bins of p-values
is unlikely to emerge in real life.

Therefore,
I conclude it is not true that there is a ‘surge of p-values between 0.041-0.049’, nor that these data suggest there is
an increase in questionable research practices over the last 25 years. The search for evidence of an increase in questionable
research practices is starting to mirror the search for the ether. After
repeatedly claiming to observe a rise in p-values
just below 0.05 without providing substantial evidence for such a rise (De
Winter & Dodou, 2015; Leggett et al., 2013; Masicampo & LaLande, 2012),
it is time researchers investigating inflated Type 1 errors use better models,
make better predictions, and collect better data. Analyzing huge numbers of p-values, which come from studies with
huge heterogeneity, will not be able to provide any indication of the
prevalence of questionable research practices, not even when changes of p-value
distributions are analyzed over time. All these papers are evidence of is a
peculiar prevalence of incorrect conclusions about p-value distributions.