This breakthrough paper has a problem

The paper showed how 7 cases of Multiple Sclerosis in two families were all carrier of the same genetic variant, a mutation in NR1H3. The paper made the case for this variant to be the causal dominant variant for the disease in these cases.

Firstly, it is not at all unusual to find that the same genetic variant appears across many related individuals in a family tree. What would be unusual would be for this gene to be present only in individuals with the disease (MS).

However, the paper also noted that in the same two families there was a total of four healthy carriers. Suggesting that if this variant is indeed causal, it has incomplete penetrance, which detracts from the evidence of about the gene being causal if viewed without further supporting evidence.

And then, the even harder blow on the evidence is the fact that the same genetic variant was found in no less than 21 healthy carriers in the ExAC database! The evidence for a dominant variant now looks very weak…

What went wrong?

To allow for the benefit of the doubt: I cannot imagine that the authors of this paper would have continued to support their hypothesis and start planning expensive follow-up experiments if they had been aware of this data as the first thing when they started to analyse their results. What I guess happened is that by the time they thought of looking up their variant in the public ExAC database, they were already so in love with their hypothesis and potential breakthrough discovery that they simply ignored the facts that were on the table.

Unfortunately, that is not good science.

How you should plan for your next breakthrough

Here are my few small tips that I would advice you to consider in your preparation of your next big breakthrough.

ALWAYS — Look for publicly available reference data from healthy individuals to test (and hopefully validate) your hypothesis. For human genetics, you have two great resources in the ExAC database and the Reference Variant Store. Both open access, online and easy to use. There are no excuses.

Look for data sources and collaborators with data on the same disease as you are investigating, and cross-check that your results replicate also in their datasets. The more specific your finding, the more specific the cross-check. If you are looking for specific genetic variants linked to a disease you can use e.g. ClinVar, Cafe Variome and SNPedia. If you are looking for more complex signals in the data, or characteristics that are not well annotated yet, such as genome rearrangements or haplotypes go find the raw data files from various repositories and data sources by searching on Repositive.

Look at your available data with an unbiased mind — good practice is to also ask a colleague for unbiased review of your conclusion. What does the data tell you?