One of the cornerstones of scientific research is the reproducibility of findings. Novel scientific observations need to be validated by subsequent studies in order to be considered robust. This has proven to be somewhat of a challenge for many biomedical research areas, including high impact studies in cancer research and stem cell research. The fact that an initial scientific finding of a research group cannot be confirmed by other researchers does not mean that the initial finding was wrong or that there was any foul play involved. The most likely explanation in biomedical research is that there is tremendous biological variability. Human subjects and patients examined in one research study may differ substantially from those in follow-up studies. Biological cell lines and tools used in basic science studies can vary widely, depending on so many details such as the medium in which cells are kept in a culture dish. The variability in findings is not a weakness of biomedical research, in fact it is a testimony to the complexity of biological systems. Therefore, initial findings need to always be treated with caution and presented with the inherent uncertainty. Once subsequent studies – often with larger sample sizes – confirm the initial observations, they are then viewed as being more robust and gradually become accepted by the wider scientific community.

Even though most scientists become aware of the scientific uncertainty associated with an initial observation as their career progresses, non-scientists may be puzzled by shifting scientific narratives. People often complain that “scientists cannot make up their minds” – citing examples of newspaper reports such as those which state drinking coffee may be harmful only to be subsequently contradicted by reports which laud the beneficial health effects of coffee drinking. Accurately communicating scientific findings as well as the inherent uncertainty of such initial findings is a hallmark of critical science journalism.

A group of researchers led by Dr. Estelle Dumas-Mallet at the University of Bordeaux recently studied the extent of uncertainty communicated to the public by newspapers when reporting initial medical research findings in their recently published paper “Scientific Uncertainty in the Press: How Newspapers Describe Initial Biomedical Findings“. Dumas-Mallet and her colleagues examined 426 English-language newspaper articles published between 1988 and 2009 which described 40 initial biomedical research studies. They focused on scientific studies in which a new risk factor such as smoking or old age had been newly associated with a disease such as schizophrenia, autism, Alzheimer’s disease or breast cancer (total of 12 diseases). The researchers only included scientific studies which had subsequently been re-evaluated by follow-up research studies and found that less than one third of the scientific studies had been confirmed by subsequent research. Dumas-Mallet and her colleagues were therefore interested in whether the newspaper articles, which were published shortly after the release of the initial research paper, adequately conveyed the uncertainty surrounding the initial findings and thus adequately preparing their readers for subsequent research that may confirm or invalidate the initial work.

The University of Bordeaux researchers specifically examined whether headlines of the newspaper articles were “hyped” or “factual”, whether they mentioned whether or not this was an initial study and clearly indicated they need for replication or validation by subsequent studies. Roughly 35% of the headlines were “hyped”. One example of a “hyped” headline was “Magic key to breast cancer fight” instead of using a more factual headline such as “Scientists pinpoint genes that raise your breast cancer risk“. Dumas-Mallet and her colleagues found that even though 57% of the newspaper articles mentioned that these medical research studies were initial findings, only 21% of newspaper articles included explicit “replication statements” such as “Tests on larger populations of adults must be performed” or “More work is needed to confirm the findings”.

The researchers next examined the key characteristics of the newspaper articles which were more likely to convey the uncertainty or preliminary nature of the initial scientific findings. Newspaper articles with “hyped” headlines were less likely to mention the need for replicating and validating the results in subsequent studies. On the other hand, newspaper articles which included a direct quote from one of the research study authors were three times more likely to include a replication statement. In fact, approximately half of all the replication statements mentioned in the newspaper articles were found in author quotes, suggesting that many scientists who conducted the research readily emphasize the preliminary nature of their work. Another interesting finding was the gradual shift over time in conveying scientific uncertainty. “Hyped” headlines were rare before 2000 (only 15%) and become more frequent during the 2000s (43%). On the other hand, replication statements were more common before 2000 (35%) than after 2000 (16%). This suggests that there was a trend towards conveying less uncertainty after 2000, which is surprising because debate about scientific replicability in the biomedical research community seems to have become much more widespread in the past decade.

As in all scientific studies, we need to be aware of the analysis performed by Dumas-Mallet and her colleagues. They focused on analyzing a very narrow area of biomedical research – newly identified risk factors for selected diseases. It remains to be seen whether other areas of biomedical research such as treatment of diseases or basic science discoveries of new molecular pathways are also reported with “hyped” headlines and without replication statements. In other words – this research on “replication statements” in newspaper articles also needs to be replicated. It is not clear that the worrisome trend of over-selling robustness of initial research findings after the year 2000 still persists since the work by Dumas-Mallet and colleagues stopped analyzing studies published after 2009. One would hope that the recent discussions about replicability issues in science among scientists would reverse this trend. Even though the findings of the University of Bordeaux researchers need to be replicated by others, science journalists and readers of newspapers can glean some important information from this study: One needs to be wary of “hyped” headlines and it can be very useful to interview authors of scientific studies when reporting about new research, especially asking them about the limitations of their work. “Hyped” newspaper headlines and an exaggerated sense of certainty in initial scientific findings may erode the long-term trust of the public in scientific research, especially if subsequent studies fail to replicate the initial results. Critical and comprehensive reporting of biomedical research studies – including their limitations and uncertainty – by science journalists is therefore a very important service to society which contributes to science literacy and science-based decision making.

Reproducibility of findings is a core foundation of science. If scientific results only hold true in some labs but not in others, then how can researchers feel confident about their discoveries? How can society put evidence-based policies into place if the evidence is unreliable?

Recognition of this “crisis” has prompted calls for reform. Researchers are feeling their way, experimenting with different practices meant to help distinguish solid science from irreproducible results. Some people are even starting to reevaluate how choices are made about what research actually gets tackled. Breaking innovative new ground is flashier than revisiting already published research. Does prioritizing novelty naturally lead to this point?

Incentivizing the wrong thing?

One solution to the reproducibility crisis could be simply to conduct lots of replication studies. For instance, the scientific journal eLife is participating in an initiative to validate and reproduce important recent findings in the field of cancer research. The first set of these “rerun” studies was recently released and yielded mixed results. The results of 2 out of 5 research studies were reproducible, one was not and two additional studies did not provide definitive answers.

But there’s at least one major obstacle to investing time and effort in this endeavor: the quest for novelty. The prestige of an academic journal depends at least partly on how often the research articles it publishes are cited. Thus, research journals often want to publish novel scientific findings which are more likely to be cited, not necessarily the results of newly rerun older research.

Genetics researcher Barak Cohen at Washington University in St. Louis recently published a commentary analyzing this growing push for novelty. He suggests that progress in science depends on a delicate balance between novelty and checking the work of other scientists. When rewards such as funding of grants or publication in prestigious journals emphasize novelty at the expense of testing previously published results, science risks developing cracks in its foundation.

One of his main concerns is that scientific papers now inflate their claims in order to emphasize their novelty and the relevance of biomedical research for clinical applications. By exchanging depth of research for breadth of claims, researchers may be at risk of compromising the robustness of the work. By claiming excessive novelty and impact, researchers may undermine its actual significance because they may fail to provide solid evidence for each claim.

Prestigious journals often now demand complete scientific stories, from basic molecular mechanisms to proving their relevance in various animal models. Unexplained results or unanswered questions are seen as weaknesses. Instead of publishing one exciting novel finding that is robust, and which could spawn a new direction of research conducted by other groups, researchers now spend years gathering a whole string of findings with broad claims about novelty and impact.

Balancing fresh findings and robustness

A challenge for editors and reviewers of scientific manuscripts is assessing the novelty and likely long-term impact of the work they’re assessing. The eventual importance of a new, unique scientific idea is sometimes difficult to recognize even by peers who are grounded in existing knowledge. Many basic research studies form the basis of future practical applications. One recent study found that of basic research articles that received at least one citation, 80 percent were eventually cited by a patent application. But it takes time for practical significance to come to light.

A collaborative team of economics researchers recently developed an unusual measure of scientific novelty by carefully studying the references of a paper. They ranked a scientific paper as more novel if it cited a diverse combination of journals. For example, a scientific article citing a botany journal, an economics journal and a physics journal would be considered very novel if no other article had cited this combination of varied references before.

This measure of novelty allowed them to identify papers which were more likely to be cited in the long run. But it took roughly four years for these novel papers to start showing their greater impact. One may disagree with this particular indicator of novelty, but the study makes an important point: It takes time to recognize the full impact of novel findings.

Realizing how difficult it is to assess novelty should give funding agencies, journal editors and scientists pause. Progress in science depends on new discoveries and following unexplored paths – but solid, reproducible research requires an equal emphasis on the robustness of the work. By restoring the balance between demands and rewards for novelty and robustness, science will achieve even greater progress.

Murder your darlings. The British writer Sir Arthur Quiller Crouch shared this piece of writerly wisdom when he gave his inaugural lecture series at Cambridge, asking writers to consider deleting words, phrases or even paragraphs that are especially dear to them. The minute writers fall in love with what they write, they are bound to lose their objectivity and may not be able to judge how their choice of words will be perceived by the reader. But writers aren’t the only ones who can fall prey to the Pygmalion syndrome. Scientists often find themselves in a similar situation when they develop “pet” or “darling” hypotheses.

How do scientists decide when it is time to murder their darling hypotheses? The simple answer is that scientists ought to give up scientific hypotheses once the experimental data is unable to support them, no matter how “darling” they are. However, the problem with scientific hypotheses is that they aren’t just generated based on subjective whims. A scientific hypothesis is usually put forward after analyzing substantial amounts of experimental data. The better a hypothesis is at explaining the existing data, the more “darling” it becomes. Therefore, scientists are reluctant to discard a hypothesis because of just one piece of experimental data that contradicts it.

In addition to experimental data, a number of additional factors can also play a major role in determining whether scientists will either discard or uphold their darling scientific hypotheses. Some scientific careers are built on specific scientific hypotheses which set apart certain scientists from competing rival groups. Research grants, which are essential to the survival of a scientific laboratory by providing salary funds for the senior researchers as well as the junior trainees and research staff, are written in a hypothesis-focused manner, outlining experiments that will lead to the acceptance or rejection of selected scientific hypotheses. Well written research grants always consider the possibility that the core hypothesis may be rejected based on the future experimental data. But if the hypothesis has to be rejected then the scientist has to explain the discrepancies between the preferred hypothesis that is now falling in disrepute and all the preliminary data that had led her to formulate the initial hypothesis. Such discrepancies could endanger the renewal of the grant funding and the future of the laboratory. Last but not least, it is very difficult to publish a scholarly paper describing a rejected scientific hypothesis without providing an in-depth mechanistic explanation for why the hypothesis was wrong and proposing alternate hypotheses.

For example, it is quite reasonable for a cell biologist to formulate the hypothesis that protein A improves the survival of neurons by activating pathway X based on prior scientific studies which have shown that protein A is an activator of pathway X in neurons and other studies which prove that pathway X improves cell survival in skin cells. If the data supports the hypothesis, publishing this result is fairly straightforward because it conforms to the general expectations. However, if the data does not support this hypothesis then the scientist has to explain why. Is it because protein A did not activate pathway X in her experiments? Is it because in pathway X functions differently in neurons than in skin cells? Is it because neurons and skin cells have a different threshold for survival? Experimental results that do not conform to the predictions have the potential to uncover exciting new scientific mechanisms but chasing down these alternate explanations requires a lot of time and resources which are becoming increasingly scarce. Therefore, it shouldn’t come as a surprise that some scientists may consciously or subconsciously ignore selected pieces of experimental data which contradict their darling hypotheses.

Let us move from these hypothetical situations to the real world of laboratories. There is surprisingly little data on how and when scientists reject hypotheses, but John Fugelsang and Kevin Dunbar at Dartmouth conducted a rather unique study “Theory and data interactions of the scientific mind: Evidence from the molecular and the cognitive laboratory” in 2004 in which they researched researchers. They sat in at scientific laboratory meetings of three renowned molecular biology laboratories at carefully recorded how scientists presented their laboratory data and how they would handle results which contradicted their predictions based on their hypotheses and models.

In their final analysis, Fugelsang and Dunbar included 417 scientific results that were presented at the meetings of which roughly half (223 out of 417) were not consistent with the predictions. Only 12% of these inconsistencies lead to change of the scientific model (and thus a revision of hypotheses). In the vast majority of the cases, the laboratories decided to follow up the studies by repeating and modifying the experimental protocols, thinking that the fault did not lie with the hypotheses but instead with the manner how the experiment was conducted. In the follow up experiments, 84 of the inconsistent findings could be replicated and this in turn resulted in a gradual modification of the underlying models and hypotheses in the majority of the cases. However, even when the inconsistent results were replicated, only 61% of the models were revised which means that 39% of the cases did not lead to any significant changes.

The study did not provide much information on the long-term fate of the hypotheses and models and we obviously cannot generalize the results of three molecular biology laboratory meetings at one university to the whole scientific enterprise. Also, Fugelsang and Dunbar’s study did not have a large enough sample size to clearly identify the reasons why some scientists were willing to revise their models and others weren’t. Was it because of varying complexity of experiments and models? Was it because of the approach of the individuals who conducted the experiments or the laboratory heads? I wish there were more studies like this because it would help us understand the scientific process better and maybe improve the quality of scientific research if we learned how different scientists handle inconsistent results.

In my own experience, I have also struggled with results which defied my scientific hypotheses. In 2002, we found that stem cells in human fat tissue could help grow new blood vessels. Yes, you could obtain fat from a liposuction performed by a plastic surgeon and inject these fat-derived stem cells into animal models of low blood flow in the legs. Within a week or two, the injected cells helped restore the blood flow to near normal levels! The simplest hypothesis was that the stem cells converted into endothelial cells, the cell type which forms the lining of blood vessels. However, after several months of experiments, I found no consistent evidence of fat-derived stem cells transforming into endothelial cells. We ended up publishing a paper which proposed an alternative explanation that the stem cells were releasing growth factors that helped grow blood vessels. But this explanation was not as satisfying as I had hoped. It did not account for the fact that the stem cells had aligned themselves alongside blood vessel structures and behaved like blood vessel cells.

Even though I “murdered” my darling hypothesis of fat –derived stem cells converting into blood vessel endothelial cells at the time, I did not “bury” the hypothesis. It kept ruminating in the back of my mind until roughly one decade later when we were again studying how stem cells were improving blood vessel growth. The difference was that this time, I had access to a live-imaging confocal laser microscope which allowed us to take images of cells labeled with red and green fluorescent dyes over long periods of time. Below, you can see a video of human bone marrow mesenchymal stem cells (labeled green) and human endothelial cells (labeled red) observed with the microscope overnight. The short movie compresses images obtained throughout the night and shows that the stem cells indeed do not convert into endothelial cells. Instead, they form a scaffold and guide the endothelial cells (red) by allowing them to move alongside the green scaffold and thus construct their network. This work was published in 2013 in the Journal of Molecular and Cellular Cardiology, roughly a decade after I had been forced to give up on the initial hypothesis. Back in 2002, I had assumed that the stem cells were turning into blood vessel endothelial cells because they aligned themselves in blood vessel like structures. I had never considered the possibility that they were scaffold for the endothelial cells.

This and other similar experiences have lead me to reformulate the “murder your darlings” commandment to “murder your darling hypotheses but do not bury them”. Instead of repeatedly trying to defend scientific hypotheses that cannot be supported by emerging experimental data, it is better to give up on them. But this does not mean that we should forget and bury those initial hypotheses. With newer technologies, resources or collaborations, we may find ways to explain inconsistent results years later that were not previously available to us. This is why I regularly peruse my cemetery of dead hypotheses on my hard drive to see if there are ways of perhaps resurrecting them, not in their original form but in a modification that I am now able to test.

The family of cholesterol lowering drugs known as ‘statins’ are among the most widely prescribed medications for patients with cardiovascular disease. Large-scale clinical studies have repeatedly shown that statins can significantly lower cholesterol levels and the risk of future heart attacks, especially in patients who have already been diagnosed with cardiovascular disease. A more contentious issue is the use of statins in individuals who have no history of heart attacks, strokes or blockages in their blood vessels. Instead of waiting for the first major manifestation of cardiovascular disease, should one start statin therapy early on to prevent cardiovascular disease?

If statins were free of charge and had no side effects whatsoever, the answer would be rather straightforward: Go ahead and use them as soon as possible. However, like all medications, statins come at a price. There is the financial cost to the patient or their insurance to pay for the medications, and there is a health cost to the patients who experience potential side effects. The Guideline Panel of the American College of Cardiology (ACC) and the American Heart Association (AHA) therefore recently recommended that the preventive use of statins in individuals without known cardiovascular disease should be based on personalized risk calculations. If the risk of developing disease within the next 10 years is greater than 7.5%, then the benefits of statin therapy outweigh its risks and the treatment should be initiated. The panel also indicated that if the 10-year risk of cardiovascular disease is greater than 5%, then physicians should consider prescribing statins, but should bear in mind that the scientific evidence for this recommendation was not as strong as that for higher-risk individuals.

The recommendation that individuals with comparatively low risk of developing future cardiovascular disease (10-year risk lower than 10%) would benefit from statins was met skepticism by some medical experts. In October 2013, the British Medical Journal (BMJ)published a paper by John Abramson, a lecturer at Harvard Medical School, and his colleagues which re-evaluated the data from a prior study on statin benefits in patients with less than 10% cardiovascular disease risk over 10 years. Abramson and colleagues concluded that the statin benefits were over-stated and that statin therapy should not be expanded to include this group of individuals. To further bolster their case, Abramson and colleagues also cited a 2013 study by Huabing Zhang and colleagues in the Annals of Internal Medicine which (according to Abramson et al.) had reported that 18 % of patients discontinued statins due to side effects. Abramson even highlighted the finding from the Zhang study by including it as one of four bullet points summarizing the key take-home messages of his article.

The problem with this characterization of the Zhang study is that it ignored all the caveats that Zhang and colleagues had mentioned when discussing their findings. The Zhang study was based on the retrospective review of patient charts and did not establish a true cause-and-effect relationship between the discontinuation of the statins and actual side effects of statins. Patients may stop taking medications for many reasons, but this does not necessarily mean that it is due to side effects from the medication. According to the Zhang paper, 17.4% of patients in their observational retrospective study had reported a “statin related incident” and of those only 59% had stopped the medication. The fraction of patients discontinuing statins due to suspected side effects was at most 9-10% instead of the 18% cited by Abramson. But as Zhang pointed out, their study did not include a placebo control group. Trials with placebo groups document similar rates of “side effects” in patients taking statins and those taking placebos, suggesting that only a small minority of perceived side effects are truly caused by the chemical compounds in statin drugs.

Admitting errors is only the first step

Whether 18%, 9% or a far smaller proportion of patients experience significant medication side effects is no small matter because the analysis could affect millions of patients currently being treated with statins. A gross overestimation of statin side effects could prompt physicians to prematurely discontinue medications that have been shown to significantly reduce the risk of heart attacks in a wide range of patients. On the other hand, severely underestimating statin side effects could result in the discounting of important symptoms and the suffering of patients. Abramson’s misinterpretation of statin side effect data was pointed out by readers of the BMJ soon after the article published, and it prompted an inquiry by the journal. After re-evaluating the data and discussing the issue with Abramson and colleagues, the journal issued a correction in which it clarified the misrepresentation of the Zhang paper.

Fiona Godlee, the editor-in-chief of the BMJ also wrote an editorial explaining the decision to issue a correction regarding the question of side effects and that there was not sufficient cause to retract the whole paper since the other points made by Abramson and colleagues – the lack of benefit in low risk patients – might still hold true. Instead, Godlee recognized the inherent bias of a journal’s editor when it comes to deciding on whether or not to retract a paper. Every retraction of a peer reviewed scholarly paper is somewhat of an embarrassment to the authors of the paper as well as the journal because it suggests that the peer review process failed to identify one or more major flaws. In a commendable move, the journal appointed a multidisciplinary review panel which includes leading cardiovascular epidemiologists. This panel will review the Abramson paper as well as another BMJ paper which had also cited the inaccurately high frequency of statin side effects, investigate the peer review process that failed to identify the erroneous claims and provide recommendations regarding the ultimate fate of the papers.

Reviewing peer review

Why didn’t the peer reviewers who evaluated Abramson’s article catch the error prior to its publication? We can only speculate as to why such a major error was not identified by the peer reviewers. One has to bear in mind that “peer review” for academic research journals is just that – a review. In most cases, peer reviewers do not have access to the original data and cannot check the veracity or replicability of analyses and experiments. For most journals, peer review is conducted on a voluntary (unpaid) basis by two to four expert reviewers who routinely spend multiple hours analyzing the appropriateness of the experimental design, methods, presentation of results and conclusions of a submitted manuscript. The reviewers operate under the assumption that the authors of the manuscript are professional and honest in terms of how they present the data and describe their scientific methodology.

In the case of Abramson and colleagues, the correction issued by the BMJ refers not to Abramson’s own analysis but to the misreading of another group’s research. Biomedical research papers often cite 30 or 40 studies, and it is unrealistic to expect that peer reviewers read all the cited papers and ensure that they are being properly cited and interpreted. If this were the expectation, few peer reviewers would agree to serve as volunteer reviewers since they would have hardly any time left to conduct their own research. However, in this particular case, most peer reviewers familiar with statins and the controversies surrounding their side effects should have expressed concerns regarding the extraordinarily high figure of 18% cited by Abramson and colleagues. Hopefully, the review panel will identify the reasons for the failure of BMJ’s peer review system and point out ways to improve it.

It is difficult to obtain precise numbers to quantify the actual extent of severe research misconduct and fraud since it may go undetected. Even when such cases are brought to the attention of the academic leadership, the involved committees and administrators may decide to keep their findings confidential and not disclose them to the public. However, most researchers working in academic research environments would probably agree that these are rare occurrences. A far more likely source of errors in research is the cognitive bias of the researchers. Researchers who believe in certain hypotheses and ideas are prone to interpreting data in a manner most likely to support their preconceived notions. For example, it is likely that a researcher opposed to statin usage will interpret data on side effects of statins differently than a researcher who supports statin usage. While Abramson may have been biased in the interpretation of the data generated by Zhang and colleagues, the field of cardiovascular regeneration is currently grappling in what appears to be a case of biased interpretation of one’s own data. An institutional review by Harvard Medical School and Brigham and Women’s Hospital recently determined that the work of Piero Anversa, one of the world’s most widely cited stem cell researchers, was significantly compromised and warranted a retraction. His group had reported that the adult human heart exhibited an amazing regenerative potential, suggesting that roughly every 8 to 9 years the adult human heart replaces its entire collective of beating heart cells (a 7% – 19% yearly turnover of beating heart cells). These findings were in sharp contrast to a prior study which had found only a minimal turnover of beating heart cells (1% or less per year) in adult humans. Anversa’s finding was also at odds with the observations of clinical cardiologists who rarely observe a near-miraculous recovery of heart function in patients with severe heart disease. One possible explanation for the huge discrepancy between the prior research and Anversa’s studies was that Anversa and his colleagues had not taken into account the possibility of contaminations that could have falsely elevated the cell regeneration counts.

Improving the quality of research: peer review and more

Despite the fact that researchers are prone to make errors due to inherent biases does not mean we should simply throw our hands up in the air, say “Mistakes happen!” and let matters rest. High quality science is characterized by its willingness to correct itself, and this includes improving methods to detect and correct scientific errors early on so that we can limit their detrimental impact. The realization that lack of reproducibility of peer-reviewed scientific papers is becoming a major problem for many areas of research such as psychology, stem cell research and cancer biology has prompted calls for better ways to track reproducibility and errors in science.

One important new paradigm that is being discussed to improve the quality of scholar papers is the role of post-publication peer evaluation. Instead of viewing the publication of a peer-reviewed research paper as an endpoint, post publication peer evaluation invites fellow scientists to continue commenting on the quality and accuracy of the published research even after its publication and to engage the authors in this process. Traditional peer review relies on just a handful of reviewers who decide about the fate of a manuscript, but post publication peer evaluation opens up the debate to hundreds or even thousands of readers which may be able to detect errors that could not be identified by the small number of traditional peer reviewers prior to publication. It is also becoming apparent that science journalists and science writers can play an important role in the post-publication evaluation of published research papers by investigating and communicating research flaws identified in research papers. In addition to helping dismantle the Science Mystique, critical science journalism can help ensure that corrections, retractions or other major concerns about the validity of scientific findings are communicated to a broad non-specialist audience.

In addition to these ongoing efforts to reduce errors in science by improving the evaluation of scientific papers, it may also be useful to consider new pro-active initiatives which focus on how researchers perform and design experiments. As the head of a research group at an American university, I have to take mandatory courses (in some cases on an annual basis) informing me about laboratory hazards, ethics of animal experimentation or the ethics of how to conduct human studies. However, there are no mandatory courses helping us identify our own research biases or how to minimize their impact on the interpretation of our data. There is an underlying assumption that if you are no longer a trainee, you probably know how to perform and interpret scientific experiments. I would argue that it does not hurt to remind scientists regularly – no matter how junior or senior- that they can become victims of their biases. We have to learn to continuously re-evaluate how we conduct science and to be humble enough to listen to our colleagues, especially when they disagree with us.

Since Shinya Yamanaka’s landmark discovery that adult skin cells could be reprogrammed into embryonic-like induced pluripotent stem cells (iPSCs) by introducing selected embryonic genes into adult cells, laboratories all over the world have been using modifications of the “Yamanaka method” to create their own stem cell lines. The original Yamanaka method published in 2006 used a virus which integrated into the genome of the adult cell to introduce the necessary genes. Any introduction of genetic material into a cell carries the risk of causing genetic aberrancies that could lead to complications, especially if the newly generated stem cells are intended for therapeutic usage in patients.

Researchers have therefore tried to modify the “Yamanaka method” and reduce the risk of genetic aberrations by either using genetic tools to remove the introduced genes once the cells are fully reprogrammed to a stem cell state, introducing genes without non-integrating viruses or by using complex cocktails of chemicals and growth factors in order to generate stem cells without the introduction of any genes into the adult cells.

The papers by Obokata and colleagues at the RIKEN center in Kobe, Japan use a far more simple method to reprogram adult cells. Instead of introducing foreign genes, they suggest that one can expose adult mouse cells to a severe stress such as an acidic solution. The cells which survive acid-dipping adventure (25 minutes in a solution with pH 5.7) activate their endogenous dormant embryonic genes by an unknown mechanism. The researchers then show that these activated cells take on properties of embryonic stem cells or iPSCs if they are maintained in a stem cell culture medium and treated with the necessary growth factors. Once the cells reach the stem cell state, they can then be converted into cells of any desired tissue, both in a culture dish as well as in a developing mouse embryo. Many of the experiments in the papers were performed by starting out with adult mouse lymphocytes, but the researchers also found that mouse skin fibroblasts and other cells could also be successfully converted into an embryonic-like state using the acid stress.

My first reaction was incredulity. How could such a simple and yet noxious stress such as exposing cells to acid be sufficient to initiate a complex “stemness” program? Research labs have spent years fine-tuning the introduction of the embryonic genes, trying to figure out the optimal combination of genes and timing of when the genes are essential during the reprogramming process. These two papers propose that the whole business of introducing stem cell genes into adult cells was unnecessary – All You Need Is Acid.

This sounds too good to be true. The recent history in stem cell research has taught us that we need to be skeptical. Some of the most widely cited stem cell papers cannot be replicated. This problem is not unique to stem cell research, because other biomedical research areas such as cancer biology are also struggling with issues of replicability, but the high scientific impact of burgeoning stem cell research has forced its replicability issues into the limelight. Nowadays, whenever stem cell researchers hear about a ground-breaking new stem cell discovery, they often tend to respond with some degree of skepticism until multiple independent laboratories can confirm the results.

My second reaction was that I really liked the idea. Maybe we had never tried something as straightforward as an acid stress because we were too narrow-minded, always looking for complex ways to create stem cells instead of trying simple approaches. The stress-induction of stem cell behavior may also represent a regenerative mechanism that has been conserved by evolution. When our amphibian cousins regenerate limbs following an injury, adult tissue cells are also reprogrammed to a premature state by the stress of the injury before they start building a new limb.

The idea of stress-induced reprogramming of adult cells to an embryonic-like state also has a powerful poetic appeal, which inspired me to write the following haiku:

Just because the idea of acid-induced reprogramming is so attractive does not mean that it is scientifically accurate or replicable.

A number of concerns about potential scientific misconduct in the context of the two papers have been raised and it appears that the RIKEN center is investigating these concerns. Specifically, anonymous bloggers have pointed out irregularities in the figures of the papers and that some of the images may be duplicated. We will have to wait for the results of the investigation, but even if image errors or duplications are found, this does not necessarily mean that this was intentional misconduct or fraud. Assembling manuscripts with so many images is no easy task and unintentional errors do occur. These errors are probably far more common than we think. High profile papers undergo much more scrutiny than the average peer-reviewed paper, and this is probably why we tend to uncover them more readily in such papers. For example, image duplication errors were discovered in the 2013 Cell paper on human cloning, but many researchers agreed that the errors in the 2013 Cell paper were likely due to sloppiness during the assembly of the submitted manuscript and did not constitute intentional fraud.

Irrespective of the investigation into the irregularities of figures in the two Nature papers, the key question that stem cell researchers have to now address is whether the core findings of the Obokata papers are replicable. Can adult cells – lymphocytes, skin fibroblasts or other cells – be converted into embryonic-like stem cells by an acid stress? If yes, then this will make stem cell generation far easier and it will open up a whole new field of inquiry, leading to many new exciting questions. Do human cells also respond to acid stress in the same manner as the mouse cells? How does acid stress reprogram the adult cells? Is there an acid-stress signal that directly acts on stem cell transcription factors or does the stress merely activate global epigenetic switches? Are other stressors equally effective? Does this kind of reprogramming occur in our bodies in response to an injury such as low oxygen or inflammation because these kinds of injuries can transiently create an acidic environment in our tissues?

Researchers all around the world are currently attempting to test the effect of acid exposure on the activation of stem cell genes. Paul Knoepfler’s stem cell blog is currently soliciting input from researchers trying to replicate the work. Paul makes it very clear that this is an informal exchange of ideas so that researchers can learn from each other on a “real-time” basis. It is an opportunity to find out about how colleagues are progressing without having to wait for 6-12 months for the next big stem cell meeting or the publication of a paper confirming or denying the replication of acid-induced reprogramming. Posting one’s summary of results on a blog is not as rigorous as publishing a peer-reviewed paper with all the necessary methodological details, but it can at least provide some clues as to whether some or all of the results in the controversial Obokata papers can be replicated.

If the preliminary findings of multiple labs posted on the blog indicate that lymphocytes or skin cells begin to activate their stem cell gene signature after acid stress, then we at least know that this is a project which merits further investigation and researchers will be more willing to invest valuable time and resources to conduct additional replication experiments. On the other hand, if nearly all the researchers post negative results on the blog, then it is probably not a good investment of resources to spend the next year or so trying to replicate the results.

It does not hurt to have one’s paradigms or ideas challenged by new scientific papers as long as we realize that paradigm-challenging papers need to be replicated. The Nature papers must have undergone rigorous peer review before their publication, but scientific peer review does not involve checking replicability of the results. Peer reviewers focus on assessing the internal logic, experimental design, novelty, significance and validity of the conclusions based on the presented data. The crucial step of replicability testing occurs in the post-publication phase. The post-publication exchange of results on scientific blogs by independent research labs is an opportunity to crowd-source replicability testing and thus accelerate the scientific authentication process. Irrespective of whether or not the attempts to replicate acid-induced reprogramming succeed, the willingness of the stem cell community to engage in a dialogue using scientific blogs and evaluate replicability is an important step forward.

The cancer researchers Glenn Begley and Lee Ellis made a rather remarkable claim last year. In a commentary that analyzed the dearth of efficacious novel cancer therapies, they revealed that scientists at the biotechnology company Amgen were unable to replicate the vast majority of published pre-clinical research studies. Only 6 out of 53 landmark cancer studies could be replicated, a dismal success rate of 11%! The Amgen researchers had deliberately chosen highly innovative cancer research papers, hoping that these would form the scientific basis for future cancer therapies that they could develop. It should not come as a surprise that progress in developing new cancer treatments is so sluggish. New clinical treatments are often based on innovative scientific concepts derived from pre-clinical laboratory research. However, if the pre-clinical scientific experiments cannot be replicated, it would be folly to expect that clinical treatments based on these questionable scientific concepts would succeed.

Reproducibility of research findings is the cornerstone of science. Peer-reviewed scientific journals generally require that scientists conduct multiple repeat experiments and report the variability of their findings before publishing them. However, it is not uncommon for researchers to successfully repeat experiments and publish a paper, only to learn that colleagues at other institutions can’t replicate the findings. This does not necessarily indicate foul play. The reasons for the lack of reproducibility include intentional fraud and misconduct, yes, but more often it’s negligence, inadvertent errors, imperfectly designed experiments and the subliminal biases of the researchers or other uncontrollable variables.

Clinical studies, of new drugs, for example, are often plagued by the biological variability found in study participants. A group of patients in a trial may exhibit different responses to a new medication compared to patients enrolled in similar trials at different locations. In addition to genetic differences between patient populations, factors like differences in socioeconomic status, diet, access to healthcare, criteria used by referring physicians, standards of data analysis by researchers or the subjective nature of certain clinical outcomes – as well as many other uncharted variables – might all contribute to different results.

The claims of low reproducibility made by Begley and Ellis, however, did not refer to clinical cancer research but to pre-clinical science. Pre-clinical scientists attempt to reduce the degree of experimental variability by using well-defined animal models and standardized outcomes such as cell division, cell death, cell signaling or tumor growth. Without the variability inherent in patient populations, pre-clinical research variables should in theory be easier to control. The lack of reproducibility in pre-clinical cancer research has a significance that reaches far beyond just cancer research. Similar or comparable molecular and cellular experimental methods are also used in other areas of biological research, such as stem cell biology, neurobiology or cardiovascular biology. If only 11% of published landmark papers in cancer research are reproducible, it raises questions about how published papers in other areas of biological research fare.

Following the publication of Begley and Ellis’ commentary, cancer researchers wanted to know more details. Could they reveal the list of the irreproducible papers? How were the experiments at Amgen conducted to assess reproducibility? What constituted a successful replication? Were certain areas of cancer research or specific journals more prone to publishing irreproducible results? What was the cause of the poor reproducibility? Unfortunately, the Amgen scientists were bound by confidentiality agreements that they had entered into with the scientists whose work they attempted to replicate. They could not reveal which papers were irreproducible or specific details regarding the experiments, thus leaving the cancer research world in a state of uncertainty. If so much published cancer research cannot be replicated, how can the field progress?

Lee Ellis has now co-authored another paper to delve further into the question. In the study, published in the journal PLOS One, Ellis teamed up with colleagues at the renowned University of Texas MD Anderson Cancer Center to survey faculty members and trainees (PhD students and postdoctoral fellows) at the center. Only 15-17% of their colleagues responded to the anonymous survey, but the responses confirmed that reproducibility of papers in peer-reviewed scientific journals is a major problem. Two-thirds of the senior faculty respondents revealed they had been unable to replicate published findings, and the same was true for roughly half of the junior faculty members as well as trainees. Seventy-eight percent of the scientists had attempted to contact the authors of the original scientific paper to identify the problem, but only 38.5% received a helpful response. Nearly 44% of the researchers encountered difficulties when trying to publish findings that contradicted the results of previously published papers.

The list of scientific journals in which some of the irreproducible papers were published includes the the “elite” of scientific publications: The prestigious Nature tops the list with ten mentions, but one can also find Cancer Research (nine mentions), Cell (six mentions), PNAS (six mentions) and Science (three mentions).

Does this mean that these high-profile journals are the ones most likely to publish irreproducible results? Not necessarily. Researchers typically choose to replicate the work published in high-profile journals and use that as a foundation for new projects. Researchers at MD Anderson Cancer Center may not have been able to reproduce the results of ten cancer research papers published in Nature, but the survey did not provide any information regarding how many cancer research papers in Nature were successfully replicated.

The lack of data on successful replications is a major limitation of this survey. We know that more than half of all scientists responded “Yes” to the rather opaque question “Have you ever tried to reproduce a finding from a published paper and not been able to do so?”, but we do not know how often this occurred. Researchers who successfully replicated nine out of ten papers and researchers who failed to replicate four out of four published papers would have both responded “Yes.” Other limitations of this survey include that it does not list the specific irreproducible papers or clearly define what constitutes reproducibility. Published scientific papers represent years of work and can encompass five, ten or more distinct experiments. Does successful reproducibility require that every single experiment in a paper be replicated or just the major findings? What if similar trends are seen but the magnitude of effects is smaller than what was published in the original paper?

Does this mean that cancer research is facing a crisis? If only 11% of pre-clinical cancer research is reproducible, as originally proposed by Begley and Ellis, then it might be time to sound the alarm bells. But since we don’t know how exactly reproducibility was assessed, it is impossible to ascertain the extent of the problem. The word “crisis” also has a less sensationalist meaning: the time for a crucial decision. In that sense, cancer research and perhaps much of contemporary biological and medical research needs to face up to the current quality control “crisis.” Scientists need to wholeheartedly acknowledge that reproducibility is a major problem and crucial steps must be taken to track and improve the reproducibility of published scientific work.

First, scientists involved in biological and medical research need to foster a culture that encourages the evaluation of reproducibility and develop the necessary infrastructure. When scientists are unable to replicate results of published papers and contact the authors, the latter need to treat their colleagues with respect and work together to resolve the issue. Many academic psychologists have already recognized the importance of tracking reproducibility and initiated a large-scale collaborative effort to tackle the issue; the Harvard psychologists Joshua Hartshorne and Adena Schachner also recently proposed using a formal approach to track the reproducibility of research. Biological and medical scientists should consider adopting similar infrastructures for their research, because reproducibility is clearly not just a problem for psychology research.

Second, grant-funding agencies should provide adequate research funding for scientists to conduct replication studies. Currently, research grants are awarded to those who propose the most innovative experiments, but few — if any — funds are available for researchers who want to confirm or refute a published scientific paper. While innovation is obviously important, attempts to replicate published findings deserve recognition and funding because new work can only succeed if it is built on solid, reproducible scientific data.

In the U.S., it can take 1-2 years from when researchers submit a grant proposal to when they receive funding to conduct research. Funding agencies could consider an alternate approach, one that allows for rapid approval of small-budget grant proposals so that researchers can immediately start evaluating the reproducibility of recent breakthrough discoveries. Such funding for reproducibility testing could be provided to individual laboratories or teams of scientists such as the Reproducibility Initiative or the recent efforts of chemistry bloggers to document reproducibility.

Lastly, it is also important that scientific journals address the issue of reproducibility. One of the most common and also most heavily criticized metrics for the success of a scientific journal is its “impact factor,” an indicator of how often an average article published in the journal is cited. Even irreproducible scientific papers can be cited thousands of times and boost a journal’s “impact.”

If a system tracked the reproducibility of scientific papers, one could conceivably calculate a reproducibility score for any scientific journal. That way, a journal’s reputation would not only rest on the average number of citations but also on the reliability of the papers it publishes. Scientific journals should also consider supporting reproducibility initiatives by encouraging the publication of papers that attempted to replicate previous papers — as long as the reproducibility was tested in a rigorous fashion and independent of whether or not the replication attempts were successful.

There is no need to publish the 20th replication study that merely confirms what 19 previous studies have previously found, but publication of replication attempts is sorely needed before a consensus is reached regarding a scientific discovery. The journal PLOS One has partnered up with the Reproducibility Initiative to provide a forum for the publication of replication studies, but there is no reason why other journals should not follow.

While PLOS One publishes many excellent papers, current requirements for tenure and promotion at academic centers often require that researchers publish in certain pre-specified scientific journals, including those affiliated with certain professional societies and which carry prestige in a designated field of research. If these journals also encouraged the publication of replication attempts, more researchers would conduct them and contribute to the post-publication quality control of scientific literature.

The recent questions raised about the reproducibility of biological and medical research findings is forcing scientists to embark on a soul-searching mission. It is likely that this journey will shake up many long-held beliefs. But this reappraisal will ultimately lead to a more rigorous and reliable science.

Human fallibility not only affects how scientists interpret and present their data, but can also have a far-reaching impact on which scientific projects receive research funding or the publication of scientific results. When manuscripts are submitted to scientific journals or when grant proposal are submitted to funding agencies, they usually undergo a review by a panel of scientists who work in the same field and can ultimately decide whether or not a paper should be published or a grant funded. One would hope that these decisions are primarily based on the scientific merit of the manuscripts or the grant proposals, but anyone who has been involved in these forms of peer review knows that, unfortunately, personal connections or personal grudges can often be decisive factors.

Lack of scientific replicability, knowing about the uncertainties that come with new scientific knowledge, fraud and fudging, biases during peer review – these are all just some of the reasons why scientists rarely believe in the mystique of science. When I discuss this with acquaintances who are non-scientists, they sometimes ask me how I can love science if I have encountered these “ugly” aspects of science. My response is that I love science despite this “ugliness”, and perhaps even because of its “ugliness”. The fact that scientific knowledge is dynamic and ephemeral, the fact that we do not need to feel embarrassed about our ignorance and uncertainties, the fact that science is conducted by humans and is infused with human failings, these are all reasons to love science. When I think of science, I am reminded of the painting “Basket of Fruit” by Caravaggio, which is a still-life of a fruit bowl, but unlike other still-life paintings of fruit, Caravaggio showed discolored and decaying leaves and fruit. The beauty and ingenuity of Caravaggio’s painting lies in its ability to show fruit how it really is, not the idealized fruit baskets that other painters would so often depict.