[NOTE: s is used to represent the corrected sample standard deviation (dividing by n-1) – in contrast to the population or uncorrected sample standard deviation (dividing by n)]

The formula to compute the standard deviation from a sample of data looks complex – ugly even. But understanding the the notion of variation (e.g., standard deviation) is fundamental to statistical thinking.

At its core, the computation for sample standard deviation is very sensible (like so many other complex-appearing mathematical formulas, e.g., the distance formula). Essentially, the formula is computing a typical deviation – on “average” how far are the data from the mean. Thus the reason for computing the difference from each data point to the mean. Squaring and square rooting is important, as without it, the sum of all the deviations (which include positive and negative values) will always be zero – which would be a useless calculation. [Absolute values can be used instead of square – i.e., mean absolute deviation – though squaring is often preferred because, in contrast to absolute value, squaring is a smooth function with nice derivatives.]

Yet, if the goal is an “average” deviation, why would you divide by n-1 (and not n)? When you compute the mean, or average, you sum all the values and divide by n, so why does the sample standard deviation divide by n-1? This is an important question. One which mathematics and statistics educators need to be ready to answer.

Some might say something to the effect of, well it has to do with the “degrees of freedom” – there are n-1 degrees of freedom, so we divide by n-1. This is, to some degree, explanatory. But for the most part, this response feels like a smoke screen – it masks the explanation with sophisticated jargon, to cover up the fact that, well, it’s complicated. It’s a torero holding out a muleta as though the real deal were just behind it. Others might say that dividing by n-1 results in a slightly larger value than dividing by n, which provides a little bit of “wiggle room” for describing a typical deviation (e.g., data are unpredictable, so we need some buffer built into our statistics). Again, partly true. But then why not n-2? Might that be even better? Or why not n-1.5 (which is, in fact, at times better)? The real truth about variance and standard deviation comes from a fundamental distinction between descriptive and inferential statistics.

Descriptive statistics attempt to describe a dataset. Mean, median, and mode can be regarded as a typical value – a measure of central tendency. In this sense, the notion of standard deviation acts as a description of how spread out (varied) the data are from the mean. It’s a measure of spread – and one in which the units are the same as the data. With data that have a normal distribution, “fatter” distributions have more variance and thus a larger standard deviation, and “skinnier” distributions have less variance and thus a smaller standard deviation. It is a measure of a typical deviation from the mean, and as such plays a descriptive role – describing a feature (spread) of the dataset’s distribution. But if this is the case, why don’t we divide by n? Well, in fact, we might – if the data comprised an entire population, and not just a sample.

Inferential statistics attempt to infer information from a smaller sample to an entire population. They are a “best guess”. And this, in fact, is the most common explanation for why the standard deviation formula divides by n-1. It has to do with the fact that this value has less to do with describing the current dataset and more to do with inferring something about the population’s dataset. If you really only want to describe the current dataset, then a true “average” deviation (dividing by n) would likely be better. However, dividing by n-1 instead of n gives us very similar numbers (many times within hundredths of each other) – so much so that some argue for just dividing by n in most cases – and so the corrected sample standard deviation (dividing by n-1) also gives a near-descriptive statistic for a dataset. But in fact, its main purpose is inferential. Based on the arithmetic of expected values, the square of the standard deviation, s^2 (and not the square of the true “average” deviation), is an unbiased estimator for the population parameter, variance. (Proof Unbiased Estimator.) In general, standard deviation is referred to more often than variance – because it is simpler to grasp conceptually (i.e., same units as data) – but, in fact, it gets its calculation primarily from the fact that computing variance in this way (dividing by n-1) gives an unbiased estimate of the population’s variance. (The same is not quite true for standard deviation; the corrected sample standard deviation (dividing by n-1) gives a better estimate of the population parameter than the uncorrected sample standard deviation (dividing by n), though not quite completely unbiased.) Briefly, we note that there are times when it may be preferable to use other estimators; the maximum likelihood estimator for variance, for example, uses division by n, and has a lower mean squared error. The primary point, however, is that the statistics we use are often selected for their inferential ability to estimate, not just their descriptive power.

Sample standard deviation is prevalent in statistics for a variety of reasons. Firstly, it is rare that one ever has an entire population’s data. It is much more common to have a sample. And secondly, standard deviation is linked to one of the fundamental theorems in probability and statistics: the Central Limit Theorem (CLT). The CLT indicates that regardless of the underlying distribution of a population’s data, the distribution of the mean of n-sized (random) samples (with n sufficiently large, approximately greater than 30, and all independent and identically distributed random variables have the same mean, mu, and variance, sigma^2) will have an approximately normal distribution with parameters, N(mu, sigma^2/n). Given that the square of the sample standard deviation, s^2, is an unbiased estimator of sigma^2 (variance), and as a result of the CLT, the computation s/sqrt(n) is frequently used to provide confidence intervals for the true mean of a population. It is fairly incredible that from a single sample of, say, 100 people, we can provide with relatively high accuracy (most frequently, 95% is used) a range that contains the true population mean.

In closing, recognition that statistics are not always meant to be descriptive, instead, serving in a primarily inferential role is important. But this difference is not always that intuitive, for students or for teachers (e.g., Casey & Wasserman, 2015). This distinction must be clearer. Although the issues between standard deviation, variance, bias, and estimators are more nuanced, the broader idea that statistics are computed to have inferential meanings, and not just descriptive ones, is critical. And such key understandings must serve to guide our instruction. Otherwise we, as educators, risk providing students with smoke screens as a substitute for real reasoning and understanding.

Many people, including mathematics teachers, wrestle with the claim that 0.9999... is the same as 1. Some think of it as incorrect - the former is something just slightly less than but never "reaching" 1. Others perhaps follow the proof but think of it as mathematical "hocus-pocus" - the same as those "proofs" demonstrating that …

I am definitely biased. As a professor of mathematics education, I have fairly strong opinions about the mathematical education of children.
Two of perhaps the most current, publicized, political, and divisive issues around math education in the United States are the Common Core Mathematics Standards (CCSS-M) and Khan Academy. Both have certainly entered my …

Geometric probability makes use of measurement to determine probabilities. For example, the probability of hitting a bullseye could be found by computing the ratio of the area of the bullseye (4π in^2) and the area of the entire dart board (64π in^2), which is 4/64 = 0.0625; or the probability of breaking a stick randomly …

One of the things that makes mathematics as a discipline relatively unique is the constant progression of ideas - that is, mathematics from elementary school onward continues to build on previous developments. As a teacher, this means that one of the aspects that mathematics teachers in particular need to attend to relates to the future …

At times, studying general definitions of functions and particular classifications of functions (e.g., bijective, surjective) can seem tedious. As a secondary teacher, the functions we consider nearly always map real numbers to real numbers, which can make it difficult to be motivated to push past this into the abstract. While bijections have implications for infinite …

Frequently, students' first instinct about the volume of a pyramid is that is must be 1/2 the volume of a prism. While there are many ways of helping students understand the correct relationship, including conducting an experiment to see how many times "filling" a pyramid, say with rice, is needed to fill a prism (with …

From my own experiences, students often struggle to understand what circumference means as a measurement. (Such difficulties are also reiterated on student's performance on standardized test questions.) In order to help their conceptualization, I created a GSP document that allows students to roll and unroll a circle, keeping visual track of its length. The aim …

While many future mathematics teachers may never step foot in an abstract algebra course, those that do are often initially presented with the axiomatic definition of a group (closure, associativity, identity element, inverse elements) and then presented with some common examples of group tables - frequently small finite sets - to illustrate the impact of …