The gganimate package in action: Probability theory

Peter Kamerman

17 May 2016

Background

(You can grab the full R script from GitHub, or view the RMarkdown output - with code - at RStudio Connect).

I have wanted to try David Robinson’s (@drob) gganimate package since I came across it a few months ago. The package extends Hadley Wickham’s (@hadleywickham) ggplot2 package by adding a ‘frame’ aesthetic that provides a time factor to geoms, allowing them to be animated.

Recently, the opportunity to put the package through its paces arose while I was preparing materials for an introductory biostatistics tutorial for undergrad students. I wanted to demonstrate the central limit theorem and law of large numbers, and thought that animations would help deliver the message.

The central limit theorem provides a shortcut to knowing the sampling distribution, which is the probability distribution of a statistic (e.g., mean, proportion) for all possible samples from a population. The theorem is one of the cornerstones of probability theory because it allows you to make statements about the sampling distribution for a particular statistic without having to sample the entire population. As such, it forms the basis of statistical inference.

The central limit theorem states that the sampling distribution of an independent, random variable is normal or near-normal if a sufficiently large number of (equal-sized) samples are taken. If the sampling distribution for a statistic follows a normal or near-normal distribution we can make probability statements about the range of values in which the statistic lies. For example, there is a 68% probability that the sample statistic is within 1 standard deviation of the population value, and a 95% probability that it lies within about 2 standard deviations (see figure below). In the case of the sampling distribution, the standard deviation of the distribution is also called the standard error.

The size of the standard error also allows us to gauge the precision of the sample statistic, and this ‘width’ of the sampling distribution is dependent on the size of the samples. From a technical point of view, the standard error of the sample statistic is equal to the standard deviation of the population divided by the square root of the sample size (\(se = \frac{\sigma}{\sqrt{n}}\)). Basically, as the size of each sample increases the samples are more likely to be representative of the population, and therefore variability around the point estimate should decrease. The figure below shows the effect of increasing sample size on the precision of the standard error.

This leads us to the law of large numbers. At a simplistic level, the central limit theorem tells us about the shape of the sampling distribution, while the law of large numbers tells us where the centre of the distribution lies. From the law of large numbers we can show that the cumulative result from a large number of trials/samples tends towards the true value in the population. That is, the probability of an unusual outcome becomes smaller as the number of trials/samples increases. This convergence in probability is also called the weak law of large numbers.

gganimate: the theories in action

I used the gg_animate function from the gganimate package to put the two theories into action.

Our population

The central limit theorem holds across different distributions, and to illustrate this point I started with a population generated by taking a random sample (N = 200,000) from the exponential distribution (right-skewed) with rate 1.

The density distribution of this dataset is shown in the figure below (the mean is marked by the orange line), and it served as the population from which samples were taken to demonstrate the central limit theorem and the law of large numbers.

The central limit theorem in action

To demonstrate the central limit theorem, I took 5000 samples (without replacement) of n = 200 each from the ‘population’ of 200,000, and calculated the mean for each sample. I then tabulated the frequency at which sample means occurred across the 5000 samples, and used these data to plot a frequency histogram. However, in addition to the usual ggplot2 code, I added the gganimate aesthetic functions (‘frame’ and ‘cumulative’) to the geom_bar aesthetics. Adding these new aesthetics allowed the gg_animate function to take the ggplot2 object and to sequentially add the frequency bins of sample means as frames in an animation (‘frame’), and for each frame to built cumulatively on the previous one (‘cumulative = TRUE’).

The output is shown in the next figure, and as you can see, despite the samples being obtained from a right-skewed distribution, the distribution of sample means is roughly normal, and centred around the mean of the population (orange line).

The law of large numbers in action

To illustrate the law of large numbers, I took the 5000 sample means I had generated and calculated a cumulative mean across the samples. The cumulative mean was then plotted against the sample number in ggplot2, with the gganimate ‘frame’ and ‘cumulative’ aesthetics added to geom_line, which allowed gg_animate to depict the changing cumulative mean across increasing sample number as frames in the animation.

The resulting figure is shown below, and shows the cumulative mean of sample means getting closer to the population mean as the number of samples increases.

So there you have it, the central limit theorem and the law of large numbers graphically illustrated using the awesome gganimate package in combination with ggplot2. Making the creation of animated ggplots simple.