What is your (p) value?

The March for Science united researchers all over the planet against populism and misuse of scientific findings. It is interesting that one of the aspects of everyday research life – commitment to reproducibility – is in contrast to what usually happens in politics. To a certain extent, populism is a way of convincing people to believe in non-reproducible facts through rhetoric or unrealistic proposals. While people never seem to learn from history, the conversation between the research community and society should continue to be based on trust and reproducibility rather than confrontation and confusion. The ability to reproduce material objects has a different value in different areas of human life. Music is a champion of reproducibility in art, while reproduction of a painting may be considered a criminal act. Reproducibility in industry is an absolute law.

What is the real value of reproducibility in science? There is no doubt that it is very high. However, there are too many reasons for non-reproducibility of research experiments. We assess reproducibility with the help of testing statistical hypothesis and the “p-values” for these tests are a driving force for all our major conclusions. Do we really understand what kind of ‘devil’ is represented in this small detail? If you were to discuss it with a good statistician, you might be surprised to learn that our life-long trust in the 0.05 threshold for p-values is totally artificial and may help to generate non-reproducible results. The basis for this confusion is relatively easy to understand when you examine our experimental design more carefully and I strongly recommend reading one of the articles that explains it (Colquhoun D. 2014).

Are more stringent thresholds for p-values needed? Yes, and no: if you base an entire conclusion on the 0.05 p-value threshold there is a high risk of non-reproducible results; but when you have multiple sources of evidence, it could be safe. Conversely, if you want to get reproducible results, move to lower p-values in your statistical test! Is this obvious? Yes, but it remains an absolute truth.

From the beginning, genome-wide association studies introduced very high standards of handling statistics that, at the time (2006-2007), were far stricter than those in other areas of biomedical research. Over time it appeared that this strategy was very successful in removing big chunks of noisy data. However, it may also have excluded important findings. In 2007 we published, almost simultaneously with the Wellcome Trust case-control consortium, our genome-wide association study of rheumatoid arthritis describing an interesting hit at chromosome 9 that was not at all significant in the British study (Plenge RM et al. 2007). Now, 10 years later, it is clear that it was not the level of statistical threshold but the design of our study that made this possible. However, we used exactly the same statistical criteria to ensure reproducibility of our results in later studies. To be honest, it was a difficult time for me to draw the line of genome-wide significance without testing other hits below it, in what we call the ‘gray zone’, for possible relation to rheumatoid arthritis. It was so tempting to speculate regarding possible associations of many very plausible candidate genes! Today I should admit that it would inevitably lead me to irreproducible results and to a waste of time and resources.

Where is the link? Why have I started with communication between the research community and society and ended up discussing low p-values? The answer is that I want to emphasize that accountable and reproducible results create the basis for trust and efficient communication, which is not only important for your publication but will benefit us all. In short, more robust hypothesis formulation and data will generate increased trust in, and hopefully more support for science!