Seldom original. Often wrong. Occasionally interesting.

Why do we make statistics so hard for our students?

If you’re like me, you’re continually frustrated by the fact that undergraduate students struggle to understand statistics. Actually, that’s putting it mildly: a large fraction of undergraduates simply refuse to understand statistics; mention a requirement for statistical data analysis in your course and you’ll get eye-rolling, groans, or (if it’s early enough in the semester) a rash of course-dropping.

This bothers me, because we can’t do inference in science without statistics*. Why are students so unreceptive to something so important? In unguarded moments, I’ve blamed it on the students themselves for having decided, a priori and in a self-fulfilling prophecy, that statistics is math, and they can’t do math. I’ve blamed it on high-school math teachers for making math dull. I’ve blamed it on high-school guidance counselors for telling students that if they don’t like math, they should become biology majors. I’ve blamed it on parents for allowing their kids to dislike math. I’ve even blamed it on the boogie**.

All these parties (except the boogie) are guilty. But I’ve come to understand that my list left out the most guilty party of all: us. By “us” I mean university faculty members who teach statistics – whether they’re in Departments of Mathematics, Departments of Statistics, or (gasp) Departments of Biology. We make statistics needlessly difficult for our students, and I don’t understand why.

The problem is captured in the image above – the formulas needed to calculate Welch’s t-test. They’re arithmetically a bit complicated, and they’re used in one particular situation: comparing two means when sample sizes and variances are unequal. If you want to compare three means, you need a different set of formulas; if you want to test for a non-zero slope, you need another set again; if you want to compare success rates in two binary trials, another set still; and so on. And each set of formulas works only given the correctness of its own particular set of assumptions about the data.

Given this, can we blame students for thinking statistics is complicated? No, we can’t; but we can blame ourselves for letting them think that it is. They think so because we consistently underemphasize the single most important thing about statistics: that this complication is an illusion. In fact, every significance test works exactly the same way.

Every significance test works exactly the same way. We should teach this first, teach it often, and teach it loudly; but we don’t. Instead, we make a huge mistake: we whiz by it and begin teaching test after test, bombarding students with derivations of test statistics and distributions and paying more attention to differences among tests than to their crucial, underlying identity. No wonder students resent statistics.

What do I mean by “every significance test works exactly the same way”? All (NHST) statistical tests respond to one problem with two simple steps.

The problem:

We see apparent pattern, but we aren’t sure if we should believe it’s real, because our data are noisy.

The two steps:

Step 1. Measure the strength of pattern in our data.

Step 2. Ask ourselves, is this pattern strong enough to be believed?

Teaching the problem motivates the use of statistics in the first place (many math-taught courses, and nearly all biology-taught ones, do a good job of this). Teaching the two steps gives students the tools to test any hypothesis – understanding that it’s just a matter of choosing the right arithmetic for their particular data. This is where we seem to fall down.

Step 1, of course, is the test statistic. Our job is to find (or invent) a number that measures the strength of any given pattern. It’s not surprising that the details of computing such a number depend on the pattern we want to measure (difference in two means, slope of a line, whatever). But those details always involve the three things that we intuitively understand to be part of a pattern’s “strength” (illustrated below): the raw size of the apparent effect (in Welch’s t, the difference in the two sample means); the amount of noise in the data (in Welch’s t, the two sample standard deviations), and the amount of data in hand (in Welch’s t, the two sample sizes). You can see by inspection that these behave in the Welch’s formulas just the way they should: t gets bigger if the means are farther apart, the samples are less noisy, and/or the sample sizes are larger. All the rest is uninteresting arithmetical detail.

Step 2 is the P-value. We have to obtain a P-value corresponding to our test statistic, which means knowing whether assumptions are met (so we can use a lookup table) or not (so we should use randomization or switch to a different test***). Every test uses a different table – but all the tables work the same way, so the differences are again just arithmetic. Interpreting the P-value once we have it is a snap, because it doesn’t matter what arithmetic we did along the way: the P-value for any test is the probability of a pattern as strong as ours (or stronger), in the absence of any true underlying effect. If this is low, we’d rather believe that our pattern arose from real biology than believe it arose from a staggering coincidence (Deborah Mayo explains the philosophy behind this here, or see her excellent blog).

Of course, there are lots of details in the differences among tests. These matter, but they matter in a second-order way: until we understand the underlying identity of how every test works, there’s no point worrying about the differences. And even then, the differences are not things we need to remember; they’re things we need to know to look up when needed. That’s why if I know how to do one statistical test – any one statistical test – I know how to do all of them.

Does this mean I’m advocating teaching “cookbook” statistics? Yes, but only if we use the metaphor carefully and not pejoratively. A cookbook is of little use to someone who knows nothing at all about cooking; but if you know a handful of basic principles, a cookbook guides you through thousands of cooking situations, for different ingredients and different goals. All cooks own cookbooks; few memorize them.

So if we’re teaching statistics all wrong, here’s how to do it right: organize everything around the underlying identity. Start with it, spend lots of time on it, and illustrate it with one test (any test) worked through with detailed attention not to the computations, but to how that test takes us through the two steps. Don’t try to cover the “8 tests every undergraduate should know”; there’s no such list. Offer a statistical problem: some real data and a pattern, and ask the students how they might design a test to address that problem. There won’t be one right way, and even if there was, it would be less important than the exercise of thinking through the steps of the underlying identity.

Finally: why do instructors make statistics about the differences, not the underlying identity? I said I don’t know, but I can speculate.

When statistics is taught by mathematicians, I can see the temptation. In mathematical terms, the differences between tests are the interesting part. This is where mathematicians show their chops, and it’s where they do the difficult and important job of inventing new recipes to cook reliable results from new ingredients in new situations. Users of statistics, though, would be happy to stipulate that mathematicians have been clever, and that we’re all grateful to them, so we can get onto the job of doing the statistics we need to do.

When statistics is taught by biologists, the mystery is deeper. I think (I hope!) those of us who teach statistics all understand the underlying identity of all tests, but that doesn’t seem to stop us from the parade-of-tests approach. One hypothesis: we may be responding to pressure (perceived or real) from Mathematics departments, who can disapprove of statistics being taught outside their units and are quick to claim insufficient mathematical rigour when it is. Focus on lots of mathematical detail gives a veneer of apparent rigour. I’m not sure that my hypothesis is correct, but I’ve certainly been part of discussions with Math departments that were consistent with it.

Whatever the reasons, we’re doing real damage to our students when we make statistics complicated. It isn’t. Remember, every statistical test works exactly the same way. Teach a student that today.

Note: for a rather different take on the cookbook-stats metaphor, see Joan Strassmann’s interesting post here. I think I agree with her only in part, so you should read her piece too.

Post navigation

52 thoughts on “Why do we make statistics so hard for our students?”

Interesting post. I’ll have to re-read and digest and ponder (after lecture) but I know why I didn’t like statistics in college – the examples were incredibly boring. My stats teacher used coin flips almost every single time he was explaining a concept. Yawn. Let’s get mathematicians and biologist (or psychologist or sociologist or…) together to work out better examples and we may get more student interest.

This is exactly how I teach stats to undergrads – and they don’t believe me when I tell them the underlying principles are simple. I think this is because they have already been convinced by the usual approaches that stats are terrifying and incomprehensible. Usually, I manage to convince them eventually – chi-square tests on HWE with multiple alleles ad nauseam do the trick. I
nterestingly, the stats courses I have taken from statisticians were easier to understand than those taught by biologists. I also chaired an MSc stats thesis recently and it was quite revealing. My experience with statisticians and mathematicians has been that they are far less respectful of the “rules” than biologists are. Indeed, one of the statisticians at the thesis defence asked me if biologists are still “obsessed” (his words) with the normal distribution, and all four of them laughed derisively when I said ‘yes’. As many statisticians have pointed out to me over the years, with large sample sizes (even n>30 according to my favourite myrmecostatistician), the Central Limit Theorem basically demonstrates that parametric stats are fine even when the data are not “properly” distributed. I have been laughed at many times by statisticians who don’t understand why we get so obsessed with using the right stats. But try telling that to most ecologists, especially the ones reviewing your papers. They talk about stats exactly the way they teach them – by quoting the rules.

Miriam – yes, my experience about the “rules” matches yours. I’ve also done a LOT of randomization-based stats, and I have never ever, not even once, seen a case where just ignoring distributional assumptions and deploying the parametric test would have led me astray! That’s another way, I guess, in which we make stats too complicated…

While obsessing over normality can be problematic, being lead astray by ignoring the underlying nature of the data is always a problem. A primary problem can be in estimation. While many inferential tests, such as ANOVA and it’s kin, are robust against non-normality, the estimated effects can be severely compromised. So it depends somewhat on your objective. If you are only interested in “significant” differences, then you may get away with this, but be wary if your goal is to estimate or predict an effect. A simple example is a binomial proportion, yes or no, positive or negative, etc. Normal approximations can work well, for both testing and estimation, in circumstances where the proportion is in the vicinity of 0.5, but will fail often at more extreme values close to 0.0 or 1.0. Confidence intervals in these regions can even violate the nominal real world biological constraints of the topic by exceeding 1.0 or falling below 0.0. Researchers now have excellent tools available to them through generalized linear and nonlinear mixed models, Bayesian analyses, and similar maximum likelihood based techniques that inherently take advantage of distributional assumptions outside of normality. What’s more, they can also inherently deal with other traditional constraining assumptions such as a lack of independence and heterogeneity. A good resource here for generalized linear mixed models with examples in SAS and R is the Agronomy Journal article by Dr. W. Stroup: (https://dl.sciencesocieties.org/publications/aj/abstracts/107/2/811). Students need to be lead in this direction, not to ad hoc work arounds developed nearly a century ago.

Another, albite somewhat denser and longer, good read on inference and statistics is Hurlbert and Lombardi: Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian (http://www.sekj.org/PDF/anzf46/anzf46-311.pdf). Although I disagree with some of their points regarding Bayesian analyses, it is a good thought provoking article.

For what it is worth, I do agree with your basic premise on teaching, although in 35+ years of research and statistics, I would say from that experience that the people not doing this probably don’t really understand the topic they are teaching and, hence, dogmatically concentrate on rote formulas and equations.

Thanks, Bill – I was a bit cavalier about assumptions, and I take your point about significance vs. estimation. Also: many thanks for posting the Hurlbert and Lomardi link. I had not seen that paper, and I have a Fisher vs. Neyman-Pearson post coming up. I hope I won’t have to completely rewrite it after seeing what Hurlbert and Lombardi have to say!

Interesting approach. I use something related in my graduate statistics course. The first two weeks are a review of basic tests. I tell the students that each test statistic combines two components, a measure of effect size and a measure of confidence in the data. For each test I put up the equation with the two parts highlighted.

I do not expect them to memorize the equation, but I have been known to give them an equation on an exam and ask them to tell me what part of the equation represents the effect size…

What you describe is how I’ve come to introduce statistical inference of late. But I have rarely if ever taught the battery of statistical tests (admittedly I’ve never taught Intro Bio Stats to a hundred green undergraduates either). I could count the times I’ve needed a t-test (in real life as opposed to teaching) on the fingers of half my left hand. The *only* time I’ve needed one of those Chi-square tests (of a two-way table) or one of those weirdly-named non-parametric alternatives is in the Geographical Methods courses I had to take when I was a first year undergraduate in the UK. It baffles me that we waste time teaching these tests in their own right at all.

I recently taught an online stats class for a public health masters. It was a graduate class but may as well have been 2nd year bio stats for all that the students knew of stats when we started. The only test I mentioned was the t-test and only then as a means of teaching them about frequentist inference (not that I called it that). I used the t-test because it is conceptually simple – I didn’t even bother with Welch’s variant as the particular test wasn’t the point. I could have done with your nice figure though (the concept for which I’m now going to steal! [with appropriate attribution of course] 🙂

After that we moved straight on to linear models and from there GLMs, basic survival analysis, and very basic mixed modelling. Throughout the emphasis was on thinking about how the data arose (what was a plausible generating process), what were properties of the data we considered, and how that translated into the fitting of a model in software.

Perhaps I’m biased by the kinds of problems I have come across in my own research and the problems colleagues/students have brought to me over the past decade and a half but investing in understanding more modern statistical approaches (linear models [rather than t-tests and the bestiary of ANOVA tables and sums of squares, the diversity of which is just plain confusing] and GLMs, likelihood ratio testing, etc.) would be for more useful than learning tests.

A couple of points: i) whilst students don’t need to know the details of the tests, they certainly need to know the assumptions of those tests, and which are OK to violate and which aren’t. This goes for any statistical method. Below grad school level, I rarely bother with equations and just teach the idea behind a particular method and what the assumptions are. The computer does the math for you so why should we bother exactly how the numbers were generated (invariably they weren’t arrived at via the standard equations anyway)? Better that students understand what a method is used for, what the assumptions are, and practically how to do it.

ii) Whilst the CLT might help with our predilection for normality (why do people assume their data has to be normal?), it’s not going to help us in many situations where assuming (conditional) Gaussianity leads to nonsense predictions or the requirement for transformations that then result in the fitting of a model to a response variable that is less intuitive and not the thing we’re really interested in.

iii) I’m wary of your suggesting that statistics is easy. It isn’t! If we say statistics is easy and people find it difficult, what does that do to their morale? However, statistics isn’t any harder (at an applied level anyway) than anything else we might get students to do (taxonomy, lab chemistry, cell biochemistry, …) The issue isn’t that statistics is easier/harder than these things; part of the problem is that students/colleagues don’t see the importance of learning some basics and putting in the effort required to do it well. Couple this with your observation that we don’t teach statistics well and it is hardly any wonder that students graduate without being able to analyse data or think critically about a data analysis presented to them.

Great comment, and lots to digest here. I guess in terms of your (iii), I’d say that statistics in its richest detail gets hard, because there’s a lot of it and it gets complex. But it has a core, and that core is easy. Does that work for you?

Not quite; I teach the concept you describe so I certainly think it’s easier than teaching n different tests but I wouldn’t suggest that statistics or frequentist inference is easy. I don’t think it is any easier or harder (at this level) than the other things we teach. It’s a minor point but I worry that in telling people that something we know they find difficult — even if you throw out the “teach n different tests” approach — is actually easy, what message are we sending?

Yes, we can teach the concepts in different ways that will have different outcomes in terms of student learning, but (some) students will still find stats “difficult” because it’s mathy or computery or just different from all the other things that got prioritised ahead of the math and the stats.

En passant, it seems to me that that formula is intimidating also because is cluttered by over-explicit terms—all $s_i^2$ terms compare discounted by population size, that is, as $\frac{s_i^2}{n_i}$, so why not just give it a name?—and inconsistent typographic choices—fraction appears both in the vertical $\frac{s_i^2}{n_i}$ and linear $s_i^2} / n_i$ form—as well as unnecessary complications—I mean, $\frac{\left(\frac{s_i^2}{n_i}\right)^2}{n_i – 1}$ is $\frac{s_i^4}{n_i^2 (n_i – 1)}$, do we really need to express it with a nested fraction?

That seems to be the case quite often in lecture notes and slides. Maybe a little less often in textbooks, yet many introductory courses do not follow a textbook. In my limited experience only a minority of students actually use textbooks, the large part just rely on lecture notes…

Well, I’m not sure “lazy” is fair. I think there’s a good reason to keep s2 and n both explicit – they are intuitively different quantities, both directly related to the biology of the system, while a compounded variable might not be so clear. I will plead guilty on the inconsistent representation of the fractions, having borrowed the two equations from two different places! (OK, so maybe ‘lazy’ is fair enough there, although I’ll claim that I was being “efficient” instead).

My taste for formulas can be utterly wrong. I tend to think that offering a definition of $\frac{s_i^2}{n_i}$ gives the opportunity to explain its meaning, and after one can express the subsequent couple of equations in a more readable way (as in, I can easily read them aloud). If needed, I would present the two equations in the more term-parsimonious form and ask student to expand it… That said, I might be wrong about this example. Yet I think the point stands.

I found it very convincing but would disagree on two points
> mathematicians are those who succeed in figuring out how to think concretely about things that are abstract
Most mathematicians as my advanced analysis prof once put it have a very hard time being discrete they get stuck in abstractness. The (small) subset of mathematicians who are good at applying math are good at thinking concretely about abstract insight.

> Kill Math
Not any more than killing other mediums of representation (that Bret Victor elsewhere argues is bad). So if your vision or hearing is really poor, kill visual or aural representations if its not worth the effort and math if symbolic abstract representations are similarly so.

On the main topic: No one really understands how to teach statistics to those who are not really good at math and so likely no one really understands statistics (the logic of discovery as Ramsay put it rather than the logic of consistency or math).

Thanks for linking to me, I’ll post some excerpts with comments on errorstatistics.com soon. Quick comment, people need to cast a skeptical eye on work in statistical foundations, and be prepared to think for themselves. For example, the Hurlbert and Lombardi paper just repeats a myth about P-values and error probabilities. See for example. http://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

I think I have to disagree with the problem as you formulate it. What statistical test you are going to apply should be considered *before* you look at the data, rather than to try and confirm any apparent pattern you see in the data having obtained it. In any reasonably complex data set, there are an almost infinite number of features that could look interesting: are x and y correlated? are the data points clustered? Are they arranged in circles? Do the red points correlate more than the blue ones? Do the data points avoid part of the parameter space? Do they spell out rude words in Morse Code? etc, etc. Even if there is *no* signal at all, if your data contains 20 things that might catch your eye, then you might expect a statistical test on the one that catches your eye to be significant at 95% confidence. The only legitimate way to carry out a statistical test is to consider what hypothesis it is you want to test, figure out what data to collect to do the test and what statistic would validate your hypothesis, *then* collect and analyse the data. And if you do spot something interesting in your data that you weren’t expecting, the only way to test for its reality is by collecting another independent data set to see if it’s still there, since then you are carrying out the proper process of creating your hypothesis (“is the feature that seems to be in my first data set real?”) before you looked at the second data set.

I wouldn’t disagree at all with you here. I did skip over this step in my post, as it wasn’t really the topic; but yes, we should avoid fishing! My point was only that whichever pattern you want to look for (having decided beforehand), the underlying logic and procedure are the same!

SSS: the logic is the same even when there is a problem of data-dependent selections, fishing and the like: the reason the P-value (or other error probability) is altered is intimately related to the properties of testing reasoning. If the hypothesis is arrived at through fishing, say, then finding an impressive looking difference (from null) is fairly probable, even if due to ordinary expected variability. (How probable depends on the type of fishing, how long you fished etc.) Thus the actual P-value is not low, but high. On the other hand, the error probability guarantees will hold approximately when assumptions are satisfied and there is no fishing. Therefore, it is demonstrated that the logic of tests breaks down due to fishing and other biasing selection effects. You can prove, for example, that with enough properties to distribute, every random sample from a pop can have a characteristic shared by all members, even though it is possessed by only 50% of the population (an example from Ronald Giere).
By contrast, if you hold an account of inference where the impact of the evidence enters by way of the likelihood ratio, as with likelihoodist and Bayesian accounts, the difference in the sample space that enables the P-value to register the selection effects above disappears. Inference is conditional on the actual data, so considerations of data other than the one observed is irrelevant. Yet it is precisely such considerations that are vital for computing error probabilities like P-values.This is the gist of the likelihood principle. It is one of the central distinctions between “sampling theory” (frequentist error statistics) and other methods.

Mayo – I generally agree with the things you say about frequentist inference and appreciate that Bayesians often make howlers when criticising frequentist methods but I wish you’d stop making howlers in the other direction. As pointed out by many Bayesians it’s perfectly possible to do things like condition on stopping rules in Bayesian approaches. We would do better to focus more on constructive points of agreement imo.

Nice article and I agree with much of it. RE: the topic of how a mathematician would prefer to teach things I don’t think many would actually be too different from you. I wrote the following post and then added a postscript connecting to your ‘recipe’ approach since I had already quoted a recipe book based on category theory (or is it the other way around )!

Another pressure to teach equations is that knowledge of them is easier to assess than understanding of concepts. Knowledge is easier to acquire, too, so many students prefer formula-driven assessment to conceptual assessment, and they feel more confident because the assessment is clearly objective, whereas assessment of concepts tends to be subjective.

I find statistics to so far over the top in its counter-intuitiveness. Most of the concepts behind the formulas don’t make any sense at all. I am very good at math but I just cannot wrap my mind around concepts in statistics and probability. I’m going for a PhD in computer science but I can’t even understand high school statistics.

Basil – don’t give up. I’d say it’s true that many people find the concepts counterintuitive at first blush. Stick with it, don’t let yourself be intimidated, and you’ll get it. My favourite book to learn from, by the way? Whitlock and Schluter’s Analysis of Biological Data (Amazon.com http://amzn.to/1O39ujb)

And these crib notes from wiki extending on Goodman’s comment “Obviously, Peirce’s clearness’s third grade (the pragmatic, practice-oriented grade) Clearness in virtue of clearness of conceivable practical implications of the object’s conceived effects, such as fosters fruitful reasoning [copied from wiki entry]”

Now, being good at math ain’t enough as math is almost entirely about definitions and working strictly with them (first two grades of clarity). For statistics you really need to grasp conceivable practical implications for fruitful reasoning about variability and uncertainty. Unfortunately, that is seldom covered in intro or even advanced statistics courses as they tend to focus on the maths.

Over at INNGE (http://innge.net) we did a survey on a related topic a few years ago (title: “Lack of quantitative training among early-career ecologists: a survey of the problem and potential solutions”) and published the results in PeerJ (https://peerj.com/articles/285/). Perhaps of interest to some readers of this post; many of the same themes come up.

I took undergrad statistics and it was terrible. The professor used the longest methods of solving a problem. When I went to tutoring they showed me much shorter methods of getting the final answer.
I found it too time consuming and too complicated. Not well explained by the professor. It was the worst math class I ever had.

I love it! I’ve only been teaching stats for a couple of years (to 200 undergrads at a time) but the approach strikes a real chord with me (and not one of those ominous minor ones). I bang on about the need to report “the thre Ss” (test Statistic, Sample size & Significance) for every test they report but stressing the simple 1 problem + 2 steps at the start will (I hope) really drive home why they’re doing that at a much earlier point in the proceedings. My course is then all about getting a data set and following a flow chart to choose the “right” test for the question they want to ask. I’m fully intending to implement this in my introductory lecture in which I get the students to race maggots in the lecture theatre (don’t tell the cleaners).

Incidentally, I would add the primary teachers to the list of those responsible for numerophobia.

I agree and I don’t. I agree we suck at teaching stats. It’s something to do with explaining stats and i’ve never had a teacher or read a book that does it well. I think you are arguing to use more words or different words to explain stats. The math is extremely simple; I don’t think anyone should be put off by basic high school math. Maybe you should teach all the math first. I think statistician want to explain how to use the metrics before they’ve explained the math. I also find they start discussing how the statistical method or model is going to be used or interpreted before they finish explaining it’s most basic use. There are also way too many words in a stats text, written by someone that sucks at writing. So maybe that’s where the confusion is.

Privacy

I participate in the Amazon Services LLC Associates Program, an affiliate advertising program, and its equivalent at Chapters.ca. If you follow a link from my blog to Amazon or Chapters, your origin here will be tracked only for the purpose of paying me a pittance (with no effect on pricing for you). If you prefer not to be tracked, you can always visit an online or bricks-and-mortar bookseller directly. I promise I won't follow you there.