This semester, I am taking my last required course for my PhD program—and after 20+ years of school, possibly my last class towards a degree ever (what?!). It’s a statistics course that gives an overview of using stats in a research context. In scientific research, it seems that we are forever in search of the infamous value p<0.05. We use this benchmark across fields, experiments, and diverse datasets. My question is, why do we care if something happens 1 out of every 20 times? And why does this value determine if an outcome is relevant or not?The history of this standard dates back to Ronald Fisher, an English statistician, evolutionary biologist, mathematician, geneticist, and eugenicist. He developed important techniques and concepts, some of which you may use daily: the ANOVA, F-distributions, Fisher’s method for meta-analyses, inverse (or Bayesian) probability, permutation testing, and, yes, the p<0.05 standard for statistical significance. In addition to his contributions to statistics, Fisher also introduced the diverse concepts of allele dominance in genetics, heterozygote advantage, and the Sexy Son Hypothesis (look that one up!). Though many other brilliant mathematicians and statisticians also contributed to these concepts, Fisher’s work likely has the greatest influence on our modern research practices.Fisher established the use of p<0.05 in his 1925 book, Statistical Methods for Research Workers (SMRW). And if you are interested in a little light reading, you can check it out here. If you don’t have the weekend to devour the entire text, this is the line where he introduces the infamous 0.05:“The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.” – Ronald Fisher, 1925I think Fisher actually meant for us to take p=0.05 as a convenient indicator: Look more here! There might be something interesting! [I like to picture little cells with faces, waving hi!…] Instead, many researchers use this benchmark to assess a study’s value to the field.

For the last 90 years, science seems content to keep with this tradition. Given, the benchmark of alpha = 0.05 is not an inherent issue—except that in an effort to reach this standard, scientists forget to look at the data as a whole and often go “fishing” for statistical significance. To learn more about these issues (and the best way to use statistics), I asked professor and statistician Dr. Maria Norton a few questions. (And, if you want to learn more about Maria, check out our interview with her!)​

Dr. Maria Norton, our statistician extraordinaire!

“As researchers, we are often so excited to run the final statistical model to test our hypotheses that it is tempting to forgo the "pick and shovel work" that must be undertaken before the exciting inferential tests are run. Less interesting, but still crucial steps must be completed to familiarize yourself with your data before statistical analysis.”​I completely know what she means. Is the data normally distributed? Who cares, I just want to see if it “worked”!!

Importantly, Maria noted that, “There is a distinction that should be made between "statistical significance" and "practical significance" in any statistical analysis result, as well as the "effect size," a unit-less metric that reflects the strength of association between two variables. A good scientist will always consider all of these, because you can achieve statistical significance with an effect size that is too small to be of practical significance, simply as a result of having an enormous sample.”On the flip side, she said that it is possible to have a clinically important effect size, but the study may have too few subjects to reach statistical significance. With these pitfalls in mind, Maria provided me with helpful guidelines for any scientist performing experiments and analyzing results. So take a breath, press pause on the t-test function (I know, it’s hard!), and take the following steps:

1. Prior to beginning the study,

it is important to do a statistical power analysis, “…in order to determine the sample size you will need, in order to have a desired level of statistical power (e.g. usually you want this to be 80% or higher) to detect a given effect size (i.e. one that you deem of practical significance) to also be detected by your statistical procedure as being statistically significant.”​

2. Prior to the study and during the study,

“The old adage "GIGO" always holds--"garbage in....garbage out" meaning that unless you are very careful with your study design, measurement precision, and these other statistical issues we've been discussing, you could end up with a useless set of results.”​

3. When you first obtain the data,

take time to familiarize yourself with it. This includes looking for outlier data points, examining the frequency distribution, and asking if there is any missing data.

For that odd data point that just confuses you: “The lone outlier data point might be your only representative data point of a more rare, but still possible and perhaps important finding.”

For the data that is clustered around two values: “Does it follow a normal distribution or does it maybe need a log transformation in order to correct for skewness?”

For the empty rows in your excel spreadsheet: “A non-random pattern of missingness poses problems in the interpretation of the statistical parameter estimates you obtain, and the scientist must carefully decide what to do about it.” This is particularly true for surveys and clinical procedures.

​

4. When you have your final results,

researchers commonly, “…fail to consider alternative explanations for why our results turned out the way they did. This is a very useful discussion, whether the results turned out as expected or not.”​

5. When sharing the study’s findings with others,

“…the scientist should always provide an effect size and the actual p-value…and let the reader decide for herself how excited to get about the result, since a result of p=.0469 and a result of p=.0101 are both "p<.05" but the former is more ‘borderline’.”

Maria’s final words of advice may be the most essential. “A good scientist will spend some time thinking about her personal biases that affect what she believes about her research, her scientific theory, how the results should turn out, and what conclusions she should draw from her findings.”My final take-away: Less focus on reaching statistical significance and more focus on the process is essential for high quality science. Your results may just surprise you.

I also love this "Mining for Gold" metaphor that Maria uses in her teaching: