Goals

Review the probability distribution functions in R

Explore the Central Limit Theorem

library(tidyverse)

Distributions in R

The Normal distribution

You’ve now seen (in lecture and homework) four functions that relate to the Normal distribution: rnorm(), pnorm(), qnorm(), dnorm(). They respectively generate a sample, find the area under the curve, find quantiles, and find the density.

You can see the help for all four on one page:

?rnorm

You’ll notice all four take the arguments mean and sd which allow you to specify the desired mean (\(\mu\)) and standard deviation (\(\sigma\)).

Review your understanding by finding:

The area to the left of 1 under a Normal\((\mu = 3, \sigma^2 = 4)\).

The value, \(z\), such that the area under a Normal\((\mu = 100, \sigma^2 = 100)\) to the left of \(z\) is 0.05.

The value, \(z\), such that the area under a Normal\((\mu = 100, \sigma^2 = 100)\) to the right of \(z\) is 0.05.

A sample of size 10 from a Normal\((\mu = 1, \sigma = 0.5)\)

Other distributions

R has a whole host of other distribution functions. You can see a list of those built in at:

?Distributions

And they tend to all take a similar form, a function that starts with: r to generate samples, p for probabilities, q for quantiles and d for densities.

What function generates a sample from a Binomial distribution?

Take a look at:

?dexp

What distribution is this?

What parameter can you specify? What is the default value of this parameter?

Adding a density to a histogram

When we are investigating properties of distributions by simulation, it’s often useful to be able to add a density from a known distribution to a histogram of values.

To illustrate let’s generate a sample of 1000 from a Normal\((\mu = 1, \sigma^2 = 2)\):

norm_sample <- rnorm(1000, mean = 1, sd = sqrt(2))

We’ve seen you can create a histogram

ggplot() +
geom_histogram(aes(x = norm_sample))

But, say we want to overlay the density these samples were drawn from. In practice, you’ll be overlaying a density that may approximate the density the samples are drawn from - here it is exactly the density the samples are drawn from.

One complication, is that the area under a density is 1, but the area under a histogram depends on the sample size and the binwidth. To place both a histogram and a density function on the same scale, its useful to plot the histogram using density on the y-axis, rather than count.

You can do that by specifying y = ..density.. is the mapping:

ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..))

..density.. is a variable that gets calculated by geom_histogram(). The double dots (..) are just there to avoid a calculated variable conflicting with a name of an existing variable. You can see the computed variables available on the help page for geoms, i.e. ?geom_histogram.

If you look at the histogram, you’ll notice the scale of the y-axis has changed. It can now be used like a probability density function, in that areas under the curve can be used to estimate probabilities.

There a two ways to overlay a density: use a function that adds a line corresponding to a function to a plot, or generate a grid of points and connect them with a line. The first is generally easier, but more restrictive, so occasionally you’ll need the second.

stat_function()

stat_function() is a ggplot2 function that plots a line corresponding to its fun argument to an existing plot.

What function do we want to add: the Normal density, dnorm. We simply add it (using +) to our existing histogram making sure the x aesthetic is the same as for our histogram:

We can plug in different populations and sample sizes to explore how fast the approximation holds. Try substituting rexp(n = 10) for <<SAMPLE FROM POPULATION>> to explore the sampling distribution for the sample mean with samples of size 10 from an Exponential(1) population: