Recently researchers published a paper in which their data show, with statistical significance, that listening to a song about old age (When I’m 64) actually made people younger – not just feel younger, but to rejuvenate to a younger age. Of course, the claim lacks plausibility, and that was the point. Simmons, Nelson, and Simonsohn deliberately chose a hypothesis that was impossible in order to make a point: how easy it is to manipulate data in order to generate a false positive result.

In their paper Simmons et al describe in detail what skeptical scientists have known and been saying for years, and what other research has also demonstrated, that researcher bias can have a profound influence on the outcome of a study. They are looking specifically at how data is collected and analyzed and showing that the choices the researcher make can influence the outcome. They referred to these choices as “researcher degrees of freedom;” choices, for example, about which variables to include, when to stop collecting data, which comparisons to make, and which statistical analyses to use.

Each of these choices may be innocent and reasonable, and the researchers can easily justify the choices they make. But when added together these degrees of freedom allow for researchers to extract statistical significance out of almost any data set. Simmons and his colleagues, in fact, found that using four common decisions about data (using two dependent variables, adding 10 more observations, controlling for gender, or dropping a condition from the test) would allow for false positive statistical significance at the p<0.05 level 60% of the time, and p<0.01 level 21% of the time.

This means that any paper published with a statistical significance of p<0.05 could be more likely to be a false positive than true positive.

Worse – this effect is not really researcher fraud. In most cases researchers could be honestly making necessary choices about data collection and analysis, and they could really believe they are making the correct choices, or at least reasonable choices. But their bias will influence those choices in ways that researchers may not be aware of. Further, researchers may simply be using the techniques that “work” – meaning they give the results the researcher wants.

Worse still – it is not necessary to disclose the information necessary to detect the effect of these choices on the outcome. All of these choices about the data can be excluded from the published study. There is therefore no way for a reviewer or reader of the article to know all the “degrees of freedom” the researchers had, what analyses they tried and rejected, how they decided when to stop collecting data, etc.

This is exactly why skeptics are not impressed when, for example, ESP researchers publish papers with statistically significant but small ESP effects, such as the recent Bem papers in which he purports to show a retroactive or precognitive effect. This is as impossible as music rejuvenating listeners and skeptics properly treated it the same way – the result of subtle data manipulation til proven otherwise. Researcher bias is one of the reasons that plausibility needs to be considered in interpreting research.

Simmons, Nelson, and Simonsohn do not just describe and document the problem, they also discuss possible solutions. They list six things researchers can do, and four things journal editors can do, to reduce this problem. These steps mainly involve transparency – disclosing all the data collected (including any data excluded from the final analysis), making decisions about end points prior to any analysis, and showing the robustness of the results by showing what the results would have been had other data analysis decisions been made. Reviewers essentially make sure this was all done.

They also discuss other options that they feel would not be effective or practical. Disclosing all the raw data is certainly a good idea, but readers are unlikely to analyze the raw data on their own. They also don’t like replacing p-value analysis with a Bayesian analysis because they feel this would just increase the degrees of freedom. I am not sure I agree with them there – for example, they argue that a Bayesian analysis requires judgments about the prior probability, but it doesn’t. You can simply calculate the change in prior probability from the new data (essentially what a Bayesian approach is), without deciding what the prior probability was. It seems to me that Bayesian vs p-value both have the same problems of bias, so I agree it’s not a solution but I don’t feel it would be worse.

They also discuss the problem with replications. An exact replication would partially fix the problem, because then all of the decisions about data collection have already been made. But, they point out, prestigious journals rarely publish exact replications, and so there is little incentive for researchers to do this. Richard Wiseman encountered this problem when he tried to publish exact replications of Bem’s psi research.

Conclusion

Science is not only a self-corrective process, the methods of science itself are self-corrective. (So it’s self-corrective in its self-correctedness.) Simmons and his colleagues have done a great service in this article, highlighting the problem of subtle researcher bias in handling data, and also being very specific in quantifying the effects of specific data decisions, and offering reasonable remedies. I essentially agree with their conclusions, and their discussion about the implications of this problem.

They hit the nail on the head when they write that the goal of science is to “discover and disseminate truth.” We want to find out what is really true, not just verify our biases and desires. That is the skeptical outlook, and it is why we are so critical of papers purporting to demonstrate highly implausible claims with flimsy data. We require high levels of statistical significance, reasonable effect sizes, transparency in the data and statistical methods, and independent replication before we would conclude that a new phenomenon is likely to be true. This is the reasonable position, historically justified, in my opinion, because of the many false positives that were prematurely accepted in the past (and continue to be today).

Science works, but it’s hard. There are many ways in which errors and bias can creep into research and so researchers have to be vigilant, journal reviewers and editors have to be vigilant, and the scientific community needs to continue to self-examine and look for ways to make the process of science more reliable. Those institutions and professions that lack this rigorous self-critical and self-corrective culture and process are simply not truly scientific.

18 Responses to “Publishing False Positives”

I am no Bayesian expert, but I am pretty sure you can’t do Bayesian updating, i.e. calculate a change from prior to posterior, without making some assumption about the prior. This is the whole appeal of Bayesian methods over frequentist methods, no? If something has an extremely low or high prior, then additional data to the contrary will not update the posterior very much. So the amount a particular data set will update the posterior probability depends necessarily on the prior probability.

Most of the time this is resolved by assuming an “uninformative prior”, i.e. all outcomes are equally likely, and then deriving a posterior from that. But you are still making an assumption about the prior.

mlegower – I had that same question. I specifically asked Wagenmakers about it- he was referenced in the current article as supporting a Bayesian approach. He said you can do a Bayesian analysis without making a judgement about the prior probability. Essentially you can make statements about how much the prior probability will change, and perhaps this can be adjusted to whatever you take to be the prior probability.

Having worked in market research for a few years, this idea of fishing for significance is all too prevalent, generally without the analysts themselves realising or understanding the likelihood/existence of false positives.

The idea of sharing the data is probably the best way to address this for public research – it not only helps more technical readers when trying to understand the nightmare of expressing p values, relative risks and sample sizes in prose, but also in enabling other researchers to use the data for their own research when trying to replicate or build on the findings.

It should also be said that more accurate and descriptive method sections would often make these effects fairly clear to even an outsider

Wow, this takes me back a long way, back to the days when I was learning about statistics, and studied some ESP experiments! The paper linked to is a really nice piece of work. A few observations – Steve, I think you meant to write p < 0.05 rather than p = 0.5 (two places). Second, with regards to degrees of freedom, we are used to using the factor sqrt(N-1) to increase computed sample standard deviation relative to the true population value, where N is the sample size. This comes from "fixing" one degree of freedom, namely the mean. But the general relationship is sqrt(N – m), where m is the number of degrees of freedom "used up". Thus, if one applies enough conditions to the data so that m = N – 1 (you can't have it larger), then the correct sample standard deviation to report would be sqrt(N) times as large as the true population value.

Since p is so nonlinear in the S.D., this alone would greatly increase the proper p values for a lot of research.

Third, the reported p values are nearly always based on gaussian distributions. But it's almost never established that the distributions in question are actually near-gaussian. For an arbitrary distribution, there is a limit between p and the S.D., and it's given by the Chebychev inequality. For p value of 0.05, you have to be about 4.4 S.D. out from the mean, instead of about 2 as with the Gaussian distribution.

In my opinion, if a researcher can't demonstrate that his distribution is Gaussian, he should use Chebychev's inequality instead. that would kick a lot of results out of "statistical significance".

I will be interested to see what he says. But my understanding from my admittedly limited graduate education in Econometrics, both Bayesian and otherwise, is that every Bayesian analysis begins at least implicitly with a prior distribution of the parameter values of the data generating process and ends with posterior confidence intervals (and point estimates if you like) of those same parameters which are implied by the prior and the data.

Obviously the shift in the point estimate and the width of the interval around that point estimate indicates the degree to which the posterior changed from the prior, in the same sense that a frequentist analysis would produce a test statistic that indcates the ex ante probability that the data came from the parameters specified in the null hypothesis.

In any event, the same criticism that applies to the Bayesians (freedom to choose the prior) applies to the frequentists (freedom to choose the null hypothesis); it’s just that frequentists almost always choose the same straw man null (some parameter(s) = 0) which is identical in most cases to a Bayesian having a uniform (or uninformative) prior. Nothing stops a frequentist from testing the null hypothesis that a given parameter is different from 2, 1000, Pi, or any other number and thereby getting a “statistically significant” hypothesis test. But in every case you have to be clear on what hypothesis is being tested and on what prior distribution is assumed.

The posterior probability P(H|E) is the product of the prior probability P(H) and a “Bayes Factor” K = P(E|H)/P(E) which does not directly depend on P(H). If K can be computed, it can be reported separately from P(H) and allow anyone to plug in their own prior probability to compute their own posterior probability.

K isn’t completely independent of P(H), as P(E) = P(H)P(E|H) + P(not H)P(E|not H). K is monotonic on P(H), and goes to 1 as P(H) goes to 1, so one can use P(E|H)/P(E|not H) to give a limit on K.

If both P(E|H) and P(E|not H) were reported, the reader could calculate their Bayes Factor and posterior probability themselves, without the researcher having to assume any given prior themselves.

It’s depressing that this wasn’t found and subsequently fixed 50 years ago. Science is pathetically slow in fixing and improving meta-science–it still uses a centuries old journal system. Also, this applies to all scientific research, and not just ESP research.

For example, there is wide body of research showing humans detect pheromones, and that this has all kinds of interesting effects. But, all this research lacks basic scientific plausibility in that adult humans don’t have a functional vomeronasal organ.

This kind of researcher bias and others will be magnified in fields or areas that have a motivational component or incentive. Medicine, evolutionary psychology, economics, and so on.

blaisepascal- “If both P(E|H) and P(E|not H) were reported, the reader could calculate their Bayes Factor and posterior probability themselves, without the researcher having to assume any given prior themselves.”

[Presented entirely absent of hostility and in the interest of mutual education]

But the formula for the Bayes’ factor is K = P(E|H)/[P(H)P(E|H) + P(not H)P(E|not H)], which means that you have to assume something about P(H) to calculate it, right? Which means that you can report the probabilities of observing the data given each regime (P(E|H) and P(E|~H)), but you can’t go on to infer anything about the probabilities of each regime given the data unless you establish a prior over the regimes, correct? But if you are only reporting the probabilities of observing the data given the regime, then you are back to what is essentially a frequentist approach I would imagine.

Certainly, given the data and the methods, you can establish the posterior for any prior you might have. And maybe the best route is to report simply P(E|H), P(E|~H), and Bayes Rule so that the interested observer can plug in their own prior. You can even test the sensitivity of the posterior to the choice of prior. But it seems like that is the nature of the criticism above.

Bayesian hypothesis tests are based on the odds form of Bayes theorem:

(posterior odds) = (Bayes factor) * (prior odds), where

(Bayes factor) = P(D|H1) / P(D|H0).

The Bayes factor, above, is the amount by which the data change your degree of belief in the alternative (versus the null) hypothesis. As is evident from the odds form of Bayes theorem, the Bayes factor is independent of the prior odds, your degree of belief in the hypothesis before seeing the data. Different people will have different prior odds of the hypothesis, and the Bayes factor is independent of those subjective judgments.

However, in order to calculate the Bayes factor, itself, the statistician must specify a prior distribution on the alternative hypothesis. This distribution is needed to calculate P(D|H1), the numerator of the Bayes factor. The choice of distribution on the alternative hypothesis will affect the Bayes factor, and so, some subjectivity in a Bayesian hypothesis test is unavoidable.

However, that does not mean that anything goes. Some distributions on the alternative are more reasonable than others, and whatever distribution is chosen, it needs to be disclosed and justified. Furthermore, a sensitivity analysis can be conducted to investigate how different choices of reasonable distributions affect the Bayes factor.

Your understanding of the Bayes factor is wrong. The Bayes factor does not appear in the form of Bayes theorem that you have presented. It only appears in the odds form of Bayes theorem, which I gave in my previous post. From that equation, it is evident that the Bayes factor does not depend on the prior odds; however, it does depend on the statistician’s choice of distribution for the alternative hypothesis, as I explained, above.

jt512-
is correct about how the Bayes factor works (you can assume a prior probability and/or probability distribution– but you must make the assumption).

Further for Bayes to be valid- the information must be coming in randomly– that is the next piece of information must come randomly from all possible sources of information about the topic. (One can’t look into the bag before picking a ball)
If a researcher decides to do a study based on his/her understanding of a situation– then it is questionable that Bayes applies. (He looked into the bag and made decisions about how to pick the next ball)

And I do apologize for the strained analogy.

I’m pretty sure the misuse of statistics is not new. Back in the late 1970′s early 1980′s computer programs were developed for statistical analysis. These were then used by people who have no idea about the limitations of the mathematics.

For example– a statistical analysis is valid for a population that has been randomly sampled.
What is the population that is randomly sampled in the case of a study done on college sophomores who got paid to do the study?
The answer is NONE. It is not a random sample of any population.
So to make any conclusions about any population (other than the actual participants) using this method to get subjects is not valid according to the math.

A small study of non-randomly selected people can be a means of doing a study– the results of which might be interesting enough to do a real study (costly, time consuming).
This is one reason replication is important– but who gets paid for that? Heck, it seems the magazine wouldn’t publish attempts (both successful and not) to replicate Bem’s work.
The demand for novel findings seems higher than the demand for careful analysis and testing of said results right now.

Reading [the study], I found this: “We used father’s age to control for variation in baseline age across participants.” Can someone explain what this means?

The reason you don’t understand that sentence is because it is utter nonsense. “Baseline” refers to the starting time of a longitudinal study—a study where repeated measures are taken on subjects over time. “Baseline age” would then be the ages of the subjects at the beginning of the study.

However, the study (Study 2) in the paper, was not longitudinal; it was cross-sectional: only a single measurement was taken on each subject. Therefore, “baseline,” and hence “baseline age,” have no meaning. Furthermore, even if the study were longitudinal, you could not use the subjects’ fathers ages to adjust for differences in the subjects ages between experimental groups. If you wanted to make such an adjustment, you’d put the subjects’ own baseline ages in the models, not their fathers’.

What the “researchers” in this “study” (Study 2) appear to have done was to divide subjects into groups who listened to one of two songs. There was no significant difference in the mean ages of the subjects between the groups. The researchers then tried statistically adjusting the subjects’ ages by using a number of nonsensical factors until they found one that produced a statistically significant difference between the mean subjects’ ages between groups. That factor happened to be the subjects’ father’s age. They then dreamt up some science-y sounding rationale (“adjusting for baseline age”) to give the procedure the appearance of legitimacy. They then made the ridiculous claim that one of the songs caused a regression in age for the subjects who listened to it.

It was a silly exercise, because a difference in ages between the groups, whether due to nonsensical statistical modeling or not, does not imply regression in age.

Under weak assumptions it’s possible to show that, if you claim to have made a discovery when you observe P = 0.047, you have at least a 30% chance of being wrong (and a lot worse if it’s an implausible hypothesis).

P values do exactly what’s claimed of them. The problem is that what they tell you isn’t what you want to know. What you want to know is the false discovery rate. i.e. the probability that a “significant” result is wrong. A lot of people think that’s what the P value tells you. It isn’t.