Thursday, 9 August 2012

Like a tired boxer at the Olympic Games, the reputation of psychological science has just taken another punch to the gut. After a series of fraudscandals in social psychology and a US survey that revealed the widespread use of questionable research practices, a paper published this month finds that an unusually large number of psychology findings are reported as "just significant" in statistical terms.

The pattern of results could be indicative of dubious research practices, in which researchers nudge their results towards significance, for example by excluding troublesome outliers or adding new participants. Or it could reflect a selective publication bias in the discipline - an obsession with reporting results that have the magic stamp of statistical significance. Most likely it reflects a combination of both these influences. On a positive note, psychology, perhaps more than any other branch of science, is showing an admirable desire and ability to police itself and to raise its own standards.

E. J. Masicampo at Wake Forest University, USA, and David Lalande at Université du Québec à Chicoutimi, analysed 12 months of issues, July 2007 - August 2008, from three highly regarded psychology journals - the Journal of Experimental Psychology: General; Journal of Personality and Social Psychology; and Psychological Science.

In psychology, a common practice is to determine how probable (p) it is that the observed results in a study could have been obtained if the null hypothesis were true (the null hypothesis usually being that the treatment or intervention has no effect). The convention is to consider a probability of less than five per cent (p < .05) as an indication that the treatment or intervention really did have an influence; the null hypothesis can be rejected (this procedure is known as null hypothesis significance testing).

From the 36 journal issues Masicampo and Lalande identified 3,627 reported p values between .01 to .10 and their method was to see how evenly the p values were spread across that range (only studies that reported a precise figure were included). To avoid a bias in their approach, they counted the number of p values falling into "buckets" of different size, either .01, .005, .0025 or .00125 across the range.

The spread of p values between .01 and .10 followed an exponential curve - from .10 to .01 the number of p values increased gradually. But here's the key finding - there was a glaring bump in the distribution between .045 and .050. The number of p values falling in this range was "much greater" than you'd expect based on the frequency of p values falling elsewhere in the distribution. In other words, an uncanny abundance of reported results just sneaked into the region of statistical significance.

"Biases linked to achieving statistical significance appear to have a measurable impact on the research publication process," the researchers said.

The same general pattern was found regardless of whether Masicampo and Lalande analysed results from just one journal or all of them together, and mostly regardless of the size of the distribution buckets they looked at. Of course, there's a chance the intent behind their investigations could have biased their analyses in some way. To check this, a research assistant completely blind to the study aims analysed p values from one of the journals - the same result was found.

Masicampo and Lalande said their findings pointed to the need to educate researchers about the proper interpretation of null hypothesis significance testing and the value of alternative approaches, such as reporting effect sizes and confidence intervals. " ... [T]he field may benefit from practices aimed at counteracting the single-minded drive toward achieving statistical significance," they said.

Multiple top journals in economics and political science now demand, as a condition of publication, that all data be placed in a public repository. They do this so that readers need not deal with endless demands and excuses from authors who don't want to share their data. Is there a single prominent journal in psychology that does the same?

Increasing awareness of the way in which psychology embarrasses itself with weak replication standards is all to the good. But until leading journals and the APA change their policies, psychology will continue to lag the other social sciences.

Tarmo: pre-chosen significance level is called alpha, and it is indeed set at 0.05, although APA does not preclude a justified deviation. p-value is the calculated probability of obtaining the result as extreme as the one reported, assuming the null hypothesis is true. APA stupilates reporting exact p-value.

Most of social science is worthless and merely reflects the biases of the researchers and reviewers. It's cocktail party science - fun to talk about at parties as long as you remember the findings probably aren't true.

The so called "hard" sciences are not immune to producing "cocktail party science" either. The way that some researchers in all the sciences, including biology, jump to conclusions that are not directly supported by their findings is inappropriate.

See section 4.35 in the APA Publication Manual where it reads: "When reporting p values, report exact p values (e.g., p = .031) to two or three decimal points. However, report p values less than .001 as p < .001. The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available. However, in tables the 'p <' notation may be necessary for clarity."

The problem with all of these comments ... both + and - ... is that they assume that null hypothesis testing is of real value in scientific work. If you really examine it, you will find that significance testing really does NOT answer questions that you want to answer. For example, you want to up the possibility of having a p value that lets you reject the null, just use larger ns. If that is a key ingredient, then how valuable is the p value? Null hypothesis testing has far outlived it's usefulness and should be abandoned.

This seems unlikely given the specific range (.05 to .045) where the bump was found, but wouldn't a possible explanation for this be that when experiments are designed, researchers will often conduct a power analysis to determine how many subjects they will need to reach significance, therefore resulting in an excess of results just past the alpha level?

These kinds of biases would be of very little practical relevance if we were to substantially lower the criterion for significance. 1-in-20 already seems to be pretty high odds that published results are incorrect, even before you account for all the kinds of factors that inflate the true false positive rate. A much safer bet would be to drastically lower the statistical threshold. If we went to sigma 5 (i.e., p<.0000057), for an extreme example, then we could believe our results like physicists believe in new particles. Biases like those reported here would still exists, but would hardly matter given a good enough safety margin. It would be more expensive, there would be less papers published, but at least we could confidently believe what we read.

Alternatively, we could just do away with significant thresholds altogether. This would allow for more graded assessment of empirical results, rather than categorising everything as true vs. not-sure.

I don't see the problem here. The alpha p-value is the risk you take to draw a wrong conclusion. Pre defining the maximum risk you're willing to take at 5% is a choice, now do you take a much higher risk if your p-value is 5.5% than when it's 4.5%?Obviously, when you end up with a probability of 0.0001% you're comfortable with your results. When it's 60% it's clear as well, so why being at 4.5% is a big problem?

There are many problems with this. The most obvious is that the nominal alpha value isn't the actual alpha value (which is generally unknown and could be lower or much higher if something like optional stopping or p-hacking is at work). So

No. That's not plausible because psychology experiments operate with noisy data so that will mess things up (you can trivially simulate this using random draws from a population with fixed effect size. Second, the effect size used in the power analysis is never the true effect size (which is unknown), so even if you had a sufficiently noise-free effect to study the value plugged into the power calculation would be not be the correct value to get your sample size right.

Anon is exactly right, the real problem is with the null-hypothesis-testing paradigm itself. It is almost never scientifically interesting to know that there is an effect; the important question is how big is the effect. This question is answered not by p-values (statistical significance can always be achieved if n is large enough) but by figuring out a meaningful, interpretable parameter that measures the effect (this often requires real thought!), and then using a parameter-estimation paradigm, with confidence intervals.

The only time that null-hypothesis-testing is appropriate is if the mere existence of the effect, however small, would be surprising. An example would be something like ESP research, where mere existence would be newsworthy.

Effect sizes are still scaled to statistical variability. While better than p-values, they are still not measuring the magnitude of the effect on some natural scale.

Okay, Masicampo and Lalande (2012) found some interesting results with their "bump" right under p=.05, but what is the probability that this finding just happened by chance? I mean, if it wasn't significantly different (p<.05) than other categories, I don't think we can trust their results.