The Omnipotence of Random Sampling Distributions

Every answer to statistical problems lies within RSD

As I was teaching class the other day, I told the students I was going to reveal to them the one secret they needed to learn to understand every statistical test they would ever use. The secret was the one thing that would make statistics more of a reasonable science than a bunch of equations to memorize, the one thing they needed to pass my class. (OK, there is a lot more needed to pass the class, but without this one thing doing so is a lot harder.)

When I was an undergrad taking my probability and statistics course, I was definitely in “memorize equations” mode. I didn’t really understand why sometimes we divided by the square root of n and sometimes we didn’t for the z- or t-score. I got an A in that class, but I really didn’t “get” it.

It wasn’t until years later, as I took a series of classes in experimental design, that I finally understood that the answer to life, the universe, and everything statistical was the random sampling distribution. (What did you think it was going to be?)

For instance, if I have a population about which I want to know something, I usually can’t test it in its entirety—the population might extend into the future, so how am I going to test that? Instead, I work with what I have, a subset of the population called the “research population,” from which I can sample. I sure hope the research population does a good job representing the population, otherwise I lack external validity for any conclusions I make.

Well, that research population is still pretty big, and I really don’t want to sample all production for this month, so I take a sample from the research population. This sample can be grabbing the top 10 in a box, in which case I have no guarantee that the top 10 weren’t the lightest or whatever; or I can take a random sample and hope that the sample represents the population. There are a number of clever ways to get a random sample but that might be a topic for another article.

So in CSI terms, I have kind of a chain of evidence, population to research population to sample, as illustrated in figure 1:

Figure 1: A sampling chain of evidence

At this point, if I have taken a good sample, my random sample tells me something about the research population, which tells me something about the population as a whole. It’s kind of roundabout, I know, but the alternative is measuring the whole population, which isn’t really an economically viable alternative.

Now I get to calculate some numbers that are related to three of the four aspects of any data set that we need to know: shape, spread, and location. (The last one, stability over time, is shown via a control chart.) Because these characteristics are calculated from the sample, we call them “statistics,” and they could include skewness and kurtosis (i.e., shape); range, standard deviation, and interquartile range (spread); and mean, median, or mode (location).

But I don’t really care about the sample—that bus has come and gone. What I care about is the population. So I need to know how the statistics are related to the population parameters, which is what I really care about. Once I do know, I can make some inferences as to what those parameters are.

If I did everything correctly and I am a little lucky, my inferences about the population parameters are close to the real population parameters. But once I move away from my sample, things get fuzzier due to differences between the sample statistics and the population from which they were drawn. This is called “sampling error.” I don’t expect to get exactly the average of the population with my sample, but it turns out that sampling error is quantifiable with a certain probability, so I can bound my estimate with a certain level of known sampling error. (This is where confidence intervals come from.)

The whole chain of evidence pointing back to the perpetrator would look something like figure 2:

Figure 2: Inferences about population based on random samples

If I messed up at some point in the chain, then everything from that point forward is pretty suspect.

OK, so the distribution of my samples should look something like the distribution of my population, right? Let’s simulate a normal population with an average of 100 and a standard deviation of 10 below, and take some samples from it. The result would look like figure 3:

Figure 3: Samples from simulation of normal population and standard deviation

However, I don’t want to spend the rest of my life taking a bunch of samples; I only want to take one sample. If we can find some relationship between the samples and the population, then maybe I can get away with using fewer, maybe even using only one.

So imagine we take all possible samples of size 10. All those individuals will look exactly like the big honkin’ distribution (BHD) in figure 3, right? But what if I make a distribution of the averages of all possible samples? Well, it makes sense that such a distribution of averages is still going to have the same average as the individuals, but since I am taking the average of 10, I’ll bet the averages are closer to the real average, on average. (If that sentence makes you cross your eyes, my work here is done.)

Figure 4 illustrates this distribution of the sample averages, cleverly called the “random sampling distribution (RSD) of the means.”

Figure 4: Random sampling distribution of the means

I simulated only 15,000 means, but it should be pretty close to the random sampling distribution, which is all possible means of the given sample size of 10.

Now I see a much narrower distribution than I saw on the BHD or any individual sample (figure 3). It makes sense because I’m making a histogram of the means of each sample, right?

So why do I care? Because any time I take a sample of size 10 from my population, the average of it is going to fall on that little RSD distribution. That RSD for the means tends to be normally distributed, regardless of the shape of the population—if the sample size is large enough. And because the normal curve is defined by the mean (which we know) and the standard deviation, that only leaves one more step before we can fully describe the RSD of the means.

If we look at the distribution of the individuals, we know that the mean is 100 and the standard deviation is 10. Figure 5 shows the statistics from 15,000 samples from that population:

Figure 5: Statistics from 15,000 samples

The mean is pretty darn close to 100, as we expected, but the standard deviation of the means is a lot smaller than the 10 we know it to be for the population. That’s because, as we take larger and larger samples, the mean of the sample is going to get closer and closer to the real mean. Theory tells us the standard deviation of the RSD of the means is:

We see that our standard deviation is again pretty close to what we would have expected.

Now this is only valid for the RSD of the means. The RSD of other statistics look different, as shown in figure 6:

Figure 6: Random sampling distribution of the range

The RSD of the range is positively skewed and leptokurtic. There is no population parameter for range for a normal distribution, but you can use the average range to infer something about the standard deviation, like we do in statistical process control (SPC). Remember:

For a sample size of 10, d2 is 3.078. If we take our average range and divide by that, we get 10.04, again pretty close to what we know σ to be.

Now let's look at the RSD for standard deviations.

Figure 7: Random sampling distribution of the standard deviation

Notice how the mean of the RSD of the standard deviations is pretty far off of the real value? That is why the average standard deviation is a “biased” measure. The RSD for the variance, though, is unbiased. Weird, huh? It is positively skewed. (Get it?) Remember this from SPC, though?

For a sample size of 10, c4 = 0.9727, so using our average we would estimate σ to be 10.02.

Moving on to the RSD for skewness:

Figure 8: Random sampling distribution of the skewness

The RSD of the skewness looks to the eye to be normal-shaped, which is why we never rely on our eyes to do statistical tests. It is symmetrical, but it is leptokurtic, not normal. This is an unbiased estimator, so the average skewness is pretty close to the zero that we know it to be.

How about kurtosis?

Figure 9: RSD of the kurtosis

The RSD of the kurtosis is positively skewed and is very leptokurtic. It, too, is unbiased, so we are pretty close to zero.

Again, why do I care? Remember, the RSD are theoretical distributions of every possible sample of size n for a given BHD, and I can build one without even taking a sample. Then when I take a sample, whatever sample I get is somewhere in that distribution if I knew what that BHD really was. This is the basis for hypothesis tests. Looking at all the goofy shapes of the RSDs above, you can understand that there are going to be different tests for the different parameters. Also, once we get a sample, we can bound the error of the parameter estimate if we understand the RSD, which is where confidence intervals come from.

So the idea of the RSD underlies just about everything in inferential statistics. And yet, in my experience, people who use statistics daily often have no clue what an RSD is. That means they are probably making very costly mistakes without ever knowing it.