A new tool called the R-factor could help ensure that science is reproducible and valid, according to a preprint posted on biorxiv: Science with no fiction. The authors, led by Peter Grabitz, are so confident in their idea that they’ve created a company called Verum Analytics to promote it. But how useful is this new metric going to be?

Not very useful, in my view. The R-factor (which stands for “reproducibility, reputation, responsibility, and robustness”) strikes me as a flawed idea.

The R-factor of any result is calculated “simply by dividing the number of published reports that have verified a scientific claim by the number of attempts to do so.” In other words, it’s the proportion of published attempts to confirm a claim that were successful. Only independent attempts count. For “an investigator, a journal, or an institution”, their R-factor is the average of the R-factors for all the claims they’ve published.

Here are my main concerns with this idea:

1) It’s subject to publication biases

The R-factor is a summary of the published literature. We know that the literature is biased, for instance, positive results are more likely to be published than negative ones. Grabitz et al. know this as well, in fact they suggest that the R-factor could help to solve these kinds of problems. But the R-factor, which takes the published literature ‘at face value’, will itself be affected by publication bias, p-hacking, etc.

There are many examples of results which have been ‘replicated’ in many papers (i.e. with a high R-factor) yet which on closer inspection are statistically improbable. This is the motivation behind proposals such as p-curve analysis and the R-Index (not related to the R-factor.) These methods test whether the literature is plausible, rather than just assuming that it is, as the R-factor does.

2) It’s simplistic

The R-factor adopts a ‘show of hands’ definition of reproducibility: count the papers that support a claim, count the ones that refute it, and work out the percentages. This approach treats all studies as equally informative, but they rarely are. What about the sample sizes, for instance? Shouldn’t a study with 1,000 datapoints count more than a study with 10? In the R-factor, they’re treated the same.

There’s a deeper problem. It’s simplistic to treat every study in a black and white way as either “confirming” or “refuting” a claim. In reality, data may strongly support a hypothesis, weakly support it, or be inconclusive, and everything in between.

Now, it might be possible to modify the R-factor to address these criticisms. We could weight studies by sample size, for example. However, if we make these modifications, we’d soon end up re-inventing the existing and widely used technique of meta-analysis. Which brings me onto the next point:

3) It doesn’t improve on what we already have (meta-analysis)

The R-factor has no advantages over a proper meta-analysis. I suppose the R-factor might be easier to calculate in some cases, but probably not by much. Finding an R-factor requires us to check many papers (the authors suggest all of the papers citing the original study in question) and check whether the results confirm or refute the hypothesis. If we’re doing that, why not also record the results needed for a meta-analysis?

4) It glosses over hard questions

A selling-point of the R-factor is that it’s easy to use: “The R-factor is relatively easy to calculate, as the process… can be done by anyone with a general expertise in biomedical research.” However, this seems naive. If we ask “how many studies confirm the existence of phenomenon X?”, this begs at least two questions: what is X? And what does it take to confirm it? Both may be substantial scientific or even philosophical questions.

Suppose for example that we’re calculating the R-factor for the claim that ‘antidepressants cause suicide’. We find a paper reporting that antidepressants increase suicide attempts but not suicide deaths. Does that confirm the hypothesis, refute it, or neither? Opinions might differ. This is not a contrived example, it’s based on a real debate. So two people could calculate two different R-factors from the same literature.

5)It’s an impoverished metric

If my claim has only been tested once, and passed that one test, it will have an R-factor of 1. If your claim has passed 99 out of 100 tests, it will have a lower R-factor than mine (0.99), yet most people would say that your claim is more replicable than mine. The R-factor doesn’t take the number of replications into account. This problem could be fixed, perhaps, by adding some kind of a confidence interval to the measure. (Edit: in fact the authors sometimes use subscripts to indicate the number; but not consistently. See comments.)

*

To be fair to Grabitz et al., I think they have a specific kind of studies in mind for the R-factor, namely molecular biology studies. The authors don’t explicitly state this limitation, in fact saying that “The R-factor is universal in that it is applicable to any scientific claim”, but most of the examples they give are from cancer biology.

For molecular biology, the R-factor does make some sense. Molecular biology studies don’t tend to use statistics. The results are presented in a qualitative manner, illustrated with blots. You can’t meta-analyze blots: they either show the pattern you’re looking for, or they don’t. So for this kind of study, my first three objections to the R-factor don’t really apply.

So the R-factor might work in some fields, but I don’t think it’s appropriate for any science that uses statistics – which includes the great majority of psychology and neuroscience.

For #5, they do put the number of tests as a subscript for the R factor, which gives a rough indication of uncertainty.

But I agree, this is simplistic measure that doesn’t seem very useful. I could also see it having significant downsides – some researchers will be hit with unfairly low R values (e.g. all large studies show the effect, but many low-power studies do not), and others will tout their unfairly high R values (e.g. due to publication bias).

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

Good point, they do sometimes use a subscript, but they don’t use it consistently. For instance: “The R-factor of 1.0 for the claim by Ward et al. can be explained by the fact that the claim can be verified unambiguously by measuring activity of the IDH mutants with an established approach. The R-factor of 0.88 for the claim by Willingham et al…”

Nonetheless I have updated the post to make this clear.

Leonid Schneider

I honestly don’t see how this R-factor analytics can work in molecular biology. First of all, the literature is ridiculously biased by bandwaggon papers. Every time someone claims something fishy in a Cell paper, it gets “reproduced” many times, simply because there are enough bad scientists around.
But in any case I can’t imagine this R-factor being entirely machine-calculated. So we will have a human arbitrator element here, which open door to outside influence or even corruption.

http://www.mazepath.com/uncleal/qz4.htm Uncle Al

Psychology is not factual, by design. Psychology provides malleable circumstantial evidence for psychiatry (re the Church of Rome, the First Council of Nicaea, and Aristotle). Psychiatry and DSM-5 are terrorism plus pecuniary corruption, both through Big Pharma, all together summing to contemporary Congregatio pro Doctrina Fidei.

Nah…I’ve been clicking here almost daily because all the news sites all went berserk last November and many days the first comment I see is “This user is blocked.” I don’t know why I even bothered mentioning it.

R-Factor ironically sounds like a solution someone who caused this problem would come up with – in other words, they don’t really get it.

Re: Antidepressants and suicide. Does anyone really honestly believe at this point that SSRIs do not increase suicide? I mean, we now know that 22 people committed suicide while taking Paxil in clinical trials and 0 did so on placebo. I don’t need to be a genius to understand the implications of that. Or to recognize that the manufacturers lied about the data for other SSRIs, just as Glaxo did for Paxil. It’s also not hard to see that suicide and other side effects are dose dependent and therefore will be worse for more potent drugs with shorter half-life.

Yuri Lazebnik

Dear Neuroskeptic,

Thank you for writing about our preprint and the R-factor. We are pleased that although the evaluation is overall skeptical, as expected, it does suggest that the R-factor may work at least for some areas of biology.

We would like to address the skeptical part of the evaluation and the comments your post has generated, but first would make several general clarifications.

First, we see the R-factor as a scoreboard, or a barometer of what is happening in various research fields, not as an ideal arbitrator or guarantor of science.

Second, we think that revealing information currently buried under the sea of spurious citations at the very least would help researchers to more accurately assess a published claim. In other words, we think it is easier to evaluate ten papers relevant to the veracity of the claim than several hundreds.

Third, the R-factor offers an evaluation approach that is qualitatively different from the current standard of “the more citations a paper has the better it is”.

Now, to the criticisms.

“1) It’s subject to publication bias.”

Indeed, negative results often remain unpublished, which skews the perception (the mental R-factor, so to speak) towards viewing published reports as correct, especially if they come from prominent laboratories and appear in prominent journals. The R-factor would reflect this bias as a fact of life, as it would reflect the attempts to correct the bias by the approaches you are mentioning. To help mitigate the effect of publication bias, we plan to include preprints and theses in our calculations. Moreover, we think that the R-factor would encourage the publication of more “negative” results, replications, and overall tests of an idea.

“2) It’s simplistic”

We would like to suggest that the R-factor is simple, not simplistic. To cite the author, “In reality, data may strongly support a hypothesis, weakly support it, or be inconclusive, and everything in between.” We agree. However, whether something is weakly supportive or strongly supportive can be in the eye of the beholder. To simplify this complexity we do rely on a “show of hands”, keeping in mind that they represent independent studies that have passed peer review.

“3) It does not improve over what we already have (meta-analysis)”

Full-scale meta-analyses are difficult and time-consuming endeavors, which, in practice, limits them to a relatively small number of scientific claims. Furthermore, meta-analysis is also not an ideal approach, as it is known to suffer from a number of biases (see here and here for examples). As we mention in our article, we envision using machine learning to classify citations, which would allow the R-factor to be deployed at a large scale (theoretically, it could be applied to all published literature). The R-factor would be calculated with a predetermined algorithm that is applied in a uniform fashion, which would minimize systematic bias.

“4) It glossed over hard questions”

“If we ask “how many studies confirm the existence of phenomenon X?”, this begs at least two questions: what is X? And what does it take to confirm it? Both may be substantial scientific or even philosophical questions.”

We would like to emphasize that we did not gloss over philosophical questions but rather intentionally tried to develop a practical approach that would bypass them altogether, as some of these questions have been debated with no definitive resolutions for centuries or millennia.

Alfred Korzybski, the founder of general semantics, argued that an attempt to define precisely what is a pencil could be unexpectedly frustrating, not to mention defining more complex things and concepts. Yet, scientists operate with things and concepts successfully if they agree on definitions.

To use the example you provided:

“Suppose for example that we’re calculating the R-factor for the claim that ‘antidepressants cause suicide’. We find a paper reporting that antidepressants increase suicide attempts but not suicide deaths. Does that confirm the hypothesis, refute it, or neither? Opinions might differ. This is not a contrived example, it’s based on a real debate. So two people could calculate two different R-factors from the same literature.”

To answer this question one needs to define what is suicide and whether it is different from suicide attempt. According to Silverman et al. (DOI: 10.1521/suli.2007.37.3.264) “Suicide Attempt is now defined as a self-inflicted, potentially injurious behavior with a nonfatal outcome for which there is evidence (either explicit or implicit) of intent to die. […] If the suicide attempt resulted in death, it is defined as a Suicide.” Hence, in the provided example, the paper refutes the hypothesis that antidepressants cause suicide.

“5) It’s an impoverished metric”

“If my claim has only been tested once, and passed that one test, it will have an R-factor of 1. If your claim has passed 99 out of 100 tests, it will have a lower R-factor than mine (0.99), yet most people would say that your claim is more replicable than mine. The R-factor doesn’t take the number of replications into account. This problem could be fixed, perhaps, by adding some kind of a confidence interval to the measure.”

The R-factor indicates in the subscript the number of studies used to calculate it. We leave it to the user to decide on the significance of its value.

“So the R-factor might work in some fields, I don’t think it’s appropriate for any science that uses statistics – which includes the great majority of psychology and neuroscience.”

We would like to suggest that statistics alone might be insufficient to evaluate a claim if only because statistics by itself cannot distinguish a correlation from causality. For example, the rate of suicide correlates with the spending on science: http://tylervigen.com/view_correlation?id=1597

Comment by a reader: “But in any case I can’t imagine this R-factor being entirely machine-calculated. So we will have a human arbitrator element here, which open door to outside influence or even corruption.”

Any human activity is open to corruption. We plan to avoid this pitfall by making all the protocols, the technology, and the results open to the public scrutiny.

As to what is possible and what is not, we keep in mind the statement attributed to Bohr, that predicting is difficult, especially about the future. We invite readers to help us train our automatic classifier to find whether scientific articles can be classified automatically.

Thank you, Neuroskeptic, and the authors of the comments for a very interesting discussion, which, we hope, will continue because improving the practice of science, which is our primary goal, requires a communal effort.

“However, whether something is weakly supportive or strongly supportive can be in the eye of the beholder.”
Are we talking science? Cause what you say applies more to arts.
ICC of 0.9 is strongly supportive, there is no debate about it, it doesn’t confound with I like it or not.

In Bayesian terms, a Bayes Factor of <3 is considered “anecdotal” evidence for a hypothesis, 3-10 is “substantial” and >10 is “strong” evidence.

These cut-offs are of course arbitrary, although so long as they are consistently applied, they serve well enough.

However, we don’t need cut-offs at all if we do a meta-analysis. Then we will take into account the amount of evidence provided by each study, based on the sample size, effect size and any other relevant factors. We can then calculate the estimate of the overall effect with a confidence interval.

Yuri Lazebnik

Thanks!

Erik Bosma

Sounds like a couple of guys capitalizing on our inherent laziness.

Bob Beeman

Sounds like science by committee.

polistra24

Using logic and accepting the facts of genetics would eliminate all of these problems. Social “science” is incapable of both, so the problems will continue. The proper answer is to eliminate social “science” from all grants and journals, stop calling it science. That’s not gonna happen either, because social “science” serves the evil ends of governments and corporations too perfectly.

Giorgio Arcara

I sincerly don’t understand why there is such an interest in summarizing complex things with just one number

Brenton Wiernik

Yes, this is just a simple repackaging of vote count methods of meta-analysis. It will have all of the problems of those methods, with the added problem that a tiny denominator (few published attempts) will make papers with low reproducibility look unreasonably robust.

Dr__P

Unless you eventually focus on effect sizes and practical significance instead of p values and statistical significance, nothing is accomplished

A negative finding is not the same as a null finding. How many studies purport to report “negative” findings when only the null is tested. Distinctive wording of the hypothesis is important when reporting findings.

C’est la même

How many published reports have there been on the Higgs-Boson again? 😉

John S.

Neuroskeptic, I’d be interested in having you be part of an online symposium with a handful of philosophers, neuroscientists, and psychologists regarding a new paper about the proper definition of statistical significance. Would you consider e-mailing me at the address affiliated with this ID? Sorry for this cruddy way of contacting you — I couldn’t find another one!

Sys Best

Finally, greatly needed! Please, let us know where we can find updates about it.

Mike Fainzilber

I beg to differ on your comment re’ blots – blots can (and should) be quantified, and any good molecular biology study can (and should) be backed up by quantified data and appropriate statistical tests.

As to the proposed ‘R-factor’, the esteemed authors of that preprint might want to take a look at the science of past eras – some reading about phlogiston or phrenology might be instructive. Artefacts are typically highly reproducible…

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

Thanks for the comment. I was aware that some blots could in theory be quantified, but my understanding was that this generally isn’t done in most papers. And that there has never been a “blot meta-analysis”.

Discover Blogs

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.