Thoughts on null hypothesis significance testing in an SBI course

I teach simulation-based statistical inference methods (using R) in my 100-level Introduction to Data Science course. This course is the required first course for all Data Science minors, and a service course to numerous departments. I love teaching statistical inference this way because it reconnects me (and my students) with Fisher’s original ideas and methods, and expresses Tukey’s ideas that we learn about populations by being in dialogue with data. In the context of this welcome return to the empirical framework through which we understand and teach statistical inference, I wonder why we still teach students null hypothesis significance testing (NHST) in the same old way. I expect we’re all aware of the vast literature accumulated over the past 40 years that is critical of NHST and its role in the reproducibility crises in many disciplines. I feel like an introductory statistics or data science course that embraces simulated-based inference should also move away from teaching students conventional NHST methods for learning about populations.

I’m just encouraging us to think about whether the formal, reflexive method of classical NHST fits within an SBI pedagogical framework. Cohen and many others have urged us to replace NHST with inferential tools such as parameter estimation, effect size estimation, replication, and meta-analysis—tools that help us learn much more about our population of interest..

Although I am still working out this larger project in my teaching, I offer these thoughts in the context of teaching inference in the 2 independent-sample design with the mean difference statistic (M1 – M2) in an Introduction to Data Science class:

First, I help my students think about all the possible values of the mean difference, one of which is the parameter (μ1 – μ2), and how ridiculously implausible it would be for that to be 0. If the true mean difference isn’t 0, then the remaining possible values vary only in the sign and magnitude of the mean difference. This points our interest toward estimation and away from significance, and sets up the task of estimating the parameter rather than in establishing that it (probably) isn’t 0. I’ll paraphrase one of my idols, Jacob Cohen (from his famous The Earth is Round (p<.05) paper): null hypotheses are rarely true, so rejecting them is hardly surprising.

Second, I help the students think about what kind of probability distribution we need to estimate the parameter. We talk about the importance of having (or assuming we have) data from a random sample for this task, what M1 – M2 might be if we had a different random sample from this population, and how those random differences in M1 – M2 are important to estimation. Using R we generate the probability distribution under HA, not under the chance or null model used in NHST. Through resampling, students create a picture of the population of interest and its parameters. Once created, the distribution allows students to do inference through confidence interval estimates of the parameter using various methods (i.e., normal-theory, percentile).

Third, students should see that the distribution described above is a probability distribution for testing all sorts of hypotheses including, if we’re interested, the null (or Cohen would say, nil-null) hypothesis in which μ1 – μ2 = 0. Students can find the probability of any hypothesized μ1 – μ2 occurring in this population. We can then point our students toward hypothesis testing that sets evidence thresholds with practically or clinically significant standards, rather than a standard of differing from 0. For example, if the data in our 2-sample study evaluated the effect of exercise on blood pressure, we could get students to consider questions like: a) whether this effect was large enough to merit changing one’s behavior, b) what other behavioral interventions might be as, or more, effective, and c) how a replication of (or a failure to replicate) this sized outcome in another sample would change our parameter estimate and/or our confidence in the estimate.

I’m not dumping on hypothesis testing—inference by hypothesis testing is valuable way to learn about a population of interest and how it differs from other populations. I’m just encouraging us to think about whether the formal, reflexive method of classical NHST fits within an SBI pedagogical framework. Cohen and many others have urged us to replace NHST with inferential tools such as parameter estimation, effect size estimation, replication, and meta-analysis—tools that help us learn much more about our population of interest.

About Bruce Blaine

Bruce Evan Blaine, PhD, PStat®
Professor, Statistics and Data Sciences
St. John Fisher College, Rochester NY
Bruce Blaine is an applied and consulting statistician, with interests in meta-analysis, nonparametric statistics, and the quantitative methods used in psychology and the behavioral sciences.