Summer 2010 — R: ggplot2 Intro

Contents

Intro

When it comes to producing graphics in R, there are basically three options for your average user.

base graphics

lattice

ggplot2

I've written up a pretty comprehensive description for use of base graphics here, and don't intend to extend
beyond that. Base graphics are attractive, and flexible, but when it comes to creating more complex plots, like
this one, the code to create it become more cumbersome.

Both lattice and ggplot2 make creating plots of multivariate data easier. However, I find it easier
to create customized and novel plots with ggplot2 than lattice, and its syntax is more sensible to me.
This may qualify as a matter of taste, but for this reason we'll focus on creating graphics with ggplot2.

The website for ggplot2 is here: http://had.co.nz/ggplot2/. It
I would highly suggest getting a copy of the
manual:
Amazon (as of July 2010, it looks like you can buy it new for cheaper than used!).

ggplot2 Basics

ggplot2 is meant to be an implementation of
the Grammar of Graphics, hence gg-plot. The basic notion is that there is a grammar to the composition of graphical components
in statistical graphics, and by direcly controlling that grammar, you can generate a large set of carefully constructed graphics tailored
to your particular needs. Each component is added to the plot as a layer.

Plots convey information through various aspects of their aesthetics. Some aesthetics that plots use are:

x position

y position

size of elements

shape of elements

color of elements

The elements in a plot are geometric shapes, like

points

lines

line segments

bars

text

Some of these geometries have their own particular aesthetics. For instance:

points

point shape

point size

lines

line type

line weight

bars

y minimum

y maximum

fill color

outline color

text

label value

There are other basics of these graphics that you can adjust, like the scaling of the aesthetics, and the
positions of the geometries.

The values represented in the plot are the product of various statistics. If you just plot the raw data, you can think of each
point representing the identity statistic. Many bar charts represent the mean or median statistic. Histograms are bar charts where
the bars represent the binned count or density statistic.

Layer by Layer

There's a quick plotting function in ggplot2 called qplot() which is meant to be similar to the
plot() fuction from base graphics. You can do a lot with qplot(), but I think it's better to
approach the package from the layering syntax.

All ggplot2 plots begin with the function ggplot(). ggplot() takes two primary arguments:

data

The data frame containing the data to be plotted

aes()

The aesthetic mappings to pass on to the plot elements

As you can see, the second argument, aes(), isn't a normal argument, but another function. Since we'll never use
aes() as a separate function, it might be best to think of it as a special way to pass a list of arguments to the
plot.

The next step in creating a plot is to add one or more layers. Let's start with the an example from the
ggplot2 book, with the mpg data set.

?mpg
summary(mpg)

p

If you just type p or print(p), you'll get back a warning saying that the plot lacks any layers. With
the ggplot() function, we've set up a plot which is going to draw from the mpg, the displ
variable will be mapped to the x-axis, and the hwy variable is going to be mapped to the y-axis. However, we have not
determined which kind of geometric object will represent the data. Let's add points, for a scatterplot.

p + geom_point()

You add geometries to a plot with one of the geom_*() functions, using the + operator. To see a full
list of available geometries, look at the ggplot2 webpage under "Geoms". It's not necessary to assign
the ggplot() call to an object before adding geoms. This code will produce an equivalent result.

ggplot(mpg, aes(displ, hwy))+
geom_point()

Notice how we didn't pass any arguments to geom_point(). In order to map points to values on the x and y axes,
geom_point() needs to know what variables we're mapping to the x and y axes. It inherited this information from
ggplot(). The aesthetic settings from ggplot() could be overridden for any geom, and new aesthetics can
be defined within any geom, but these won't be passed on to any other.

The best way to demonstrate this is to make a few nonsensical plots. First, we'll create the same plot as above, but also connect all
the points with a line.

ggplot(mpg, aes(displ, hwy))+
geom_point()+
geom_line()

Now, we're representing the x and y variables with points and a line, connecting all the points. This isn't a very meaningful plot for
this data.

Next, we'll color the points according to the number of cylinders in the engine, treating number of cylinders as a nominal
variable. We'll pass this color mapping to geom_point().

Now the line is colored, and the points are not. It's kind of hard to tell with this plot, but lines which are different colors
are not connected. The legend also represents the fact that lines are colored.

Finally, we can pass the color mapping to ggplot(), meaning that geom_point(), geom_line(),
and any other added geom which has a color aesthetic will inherit that mapping.

Displaying Statistics

You'll frequently want to add statistical analyses to your plots, or your plots may just be of statistical summaries anyway.
ggplot2 has a few built in statistics to make plotting easier.

The most frequent statistic I use is a smoothing line with stat_smooth(). There are a number of different smoothing lines
you can add, from local regression lines (loess) to linear or logistic regressions. Let's start with the mpg data again.

p
p + geom_point() + stat_smooth()

By default, stat_smooth() has added a loess line with the standard error represented by a semi-transparent ribbon.
You could also specify the method to use to add a different smoothing line.

p + geom_point() + stat_smooth(method = "lm")

library(MASS)
p + geom_point() + stat_smooth(method = "rlm")

Now, statistics are represented with default geometries. For stat_smooth(), its default geoms are
geom_ribbon() + geom_smooth(). You could also (inadvisedly) represent the output of the smoothing function
with points and errorbars.

p + stat_smooth(geom = "point")+stat_smooth(geom = "errorbar")

Geoms also have default statistics associated with them. For geom_point(), the default statistic is
stat_identity(), but we could also change that.

There exist some stats and geoms which have the same name. Adding either one to a plot will produce the same plot. Take
*_boxplot(). A boxplot produces a shape, therefore is a particular geom. However, the parameters of each part
of a boxplot are determined by various statistics. The middle bar is the 50% percentile, the bottom and top of the box are the
25% and 75% percentiles, etc. stat_boxplot() calculates these statistics, then passes them to geom_boxplot().
The one would be pretty useless without the other, so adding geom_boxplot() to a plot automatically calculates the boxplot
statistics, and adding stat_boxplot() to a plot automatically plots the calculated statistics as a boxplot.

ggplot(mpg, aes(class, hwy))+
stat_boxplot()

#equivalent to

ggplot(mpg, aes(class, hwy))+
geom_boxplot()

The same actually goes for stat_smooth() and geom_smooth().

p + stat_smooth()

#equivalent to

p + geom_smooth()

stat_summary()

One of the statistics, stat_summary(), is somewhat special, and merits its own discussion. stat_summary()
takes a few different arguments.

fun.y

A function to produce y aestheticss

fun.ymax

A function to produce ymax aesthetics

fun.ymin

A function to produce ymin aesthetics

fun.data

A function to produce a named vector of aesthetics.

You pass a function to each of these arguments, and ggplot2 will use the value returned by that function for the corresponding
aesthetic. If you pass a function to fun.data, you can compute many summary statistics and return them as a vector, where each
element in the vector is named for the aesthetic it should be used for.

It's not necessary to write our own functions to plot quantile ranges or confidence intervals, however. There are a few summary functions
from the Hmisc package which are reformatted for use in stat_summary(). They all return aesthetics for
y, ymax, and ymin.

mean_cl_normal()

Returns sample mean and 95% confidence intervals assuming normality.

mean_sdl()

Returns sample mean and a confidence interval based on the standard deviation times some constant

mean_cl_boot()

Uses a bootstrap method to determine a confidence interval for the sample mean without assuming normality.

You can also use stat_summary() with a continuous x varuable. The summary functions will be calculated for all y values by all
unique values for x. This immediately seems useful for eye-tracking data, but I would actually suggest calculating these summaries with
ddply(). For a reasonable amount of data, stat_summary() will take a while to calculate the summaries. This
could get to be aggravating when fine tuning aesthetics and options of the plot.

Grouping

ggplot2 represents data as grouped, and draws geoms and calculates statistics according tho these groupings. We've already
seen an example of this, where lines of different colors aren't connected. Groups of data can be defined in two ways: as combinations
of aesthetic settings, or explicitly with the argument group.

We mapped the color aesthetic to the numbef of cylinders in ggplot(). When we added points to the plot,
their color was set according to their color group. When we added the linear regression lines, a model was fit for each color group, and
the fit for each group was added, colored with the same group color, and clipped according to the range of data for that group.

If we had decided to map the cylinder grouping to the point shape, rather than the point color, the statistic still would be computed
over every subset.

We could also define a grouping which is only meaningful for geom_smooth() and not geom_point(). This will
cause each smoothing line to be calculated and appear separately, but the points will be undifferentiated.

Sometimes it will be necessary to properly define the groups in your data in order to plot it. Here's another example from the
ggplot2 book.

library(nlme)
?Oxboys

ggplot(Oxboys, aes(age, height)) +
geom_point()

What if we wanted to draw a line for every subject? Simply adding geom_line() will make a mess.

ggplot(Oxboys, aes(age, height)) +
geom_line()

We need to define Subject as a grouping variable for drawing the lines. We could do this, like we did above, by defining
color = Subject. However, it's a little overkill to represent every subject with a unique color. First, knowing the
particular subject ID isn't necessarilly relevant or informative for this plot. Given that, having separate lines for each subject should
sufficiently indicate the grouping. Having separate lines and unique colors for each line counts as needless redundancy.

Clearly, this is wrong. We want to see lines for each word that was measured. Now, if we set
group = Word, we'll get an error. That's because we're actually plotting two lines for each word: F1 and F2. With just
group = Word, ggplot2 is going to draw one line for each word, which would involve mixing line types. Without
specifying linetype = variable, here's what the plot would look like with group = Word

F1 and F2 are pretty well separated, so it's probably not necessary to distinguish them with different linetypes.

If you ever want to draw connected lines over a nominal variable, you must define group. Even if you uniquely
specify groups with aesthetic settings, you still need to define group. This is actually a good thing, because it will
force you think about whether a connected up line across the nominal factor will be meaningful.

The Philadelphia counties data we looked at last week is a good example.

The data's all displayed, but not very readable. Educational attainment has a clear and meaningful order, if not magnitude. We could
add lines to this plot in a principled way. However, the following code should produce the same plot as above.

Faceting

A very useful kind of visualization technique is the small multiple. In a small multiple visualization, you create many of the same
plot for multiple subsets of the data. ggplot2 has two ways to create small multiples: facet_wrap()
and facet_grid().

facet_wrap() creates and labels a plot for every level of a factor which is passed to it. Its primary argument takes
the form of a one sided formula: ~Factor. It will then try to efficiently "wrap" these plots into a 2d grid.

Two things should be clear right off the bat. First, facets create further subsets for computing statistics over. Second, the x and y scales
of each plot are the same in each facet. This is something that can be toggled, but doing so will usually eliminate the usefulness of
creating a small multiple in the first place.

You can aslo facet by two variables using facet_grid(). Let's demonstrate with the tips dataset.

It looks like it was a male bill payer in a dinner party of two who tipped 70%. I'll leave all possible sociological analyses up to the reader.

Facet Scales

Usually you will want all of your facets to have the same x and y scales. If you're plotting the same data in each facet, having free
scales on each of the facets will ruin comparability across facets. However, sometimes it will be appropriate to have free scales. For instance,
when we plotted international data for men and women on a few different measures,
it was necessary to have free scales. I did this by passing scales = "free" to facet_wrap().

Income, LifeExpectancy, Literacy and Education are all measured on different scales with widely varying magnitude. If I hadn't
passed scales = "free" to facet_wrap(), the plot would have looked like this.

The income scale completely overwhelms he others.

Sometimes, you'll only want one or the other scales to be free. To do this, pass "free_y" or "free_x" to
scales. I found a nice example for this recently on
this blog, looking at the amount of recovered oil (measured in barrels)
and gas (measured in millions of cubic feet) from the Deepwater Horizon.

For those interested in these numbers, I'd suggest listening to this On The
Media story about how the commonly reported volume of
spilled oil in the Exxon Valdez disaster was possibly drastically underestimated.

Scales

Every aesthetic which is mapped to the data expresses the magnitude if its value along some scale. You can adjust these scales
using the scale_*() functions.

Almost all scales have a common set of arguments.

name

The text label for the scale

limits

The maximum and minimum values to be included in the scale

breaks

The labeled breaks for the data

labels

Labels for the breaks

trans

Transformation to use on the data.

The function calls for various scales are formatted like this: scale_[NAME.OF.AESTHETIC]_[NAME.OF.SCALE]()

x and y scales

The most common scale adjustments I do are for the x and y scales. The most basic way to adjust the
x and y scales for continuous data is with scale_x_continuous() or
scale_y_continuous(). However, for most adjustments you would want to make to the x and y scales, there are also aliased
functions. That is, there are functions which are called something simpler and more descriptive, but really are just particular calls to
scale_*_continuous().

An important thing to take into account is that adjustments to scales also transforms or throws away data for statistics. So for instance,
if you don't like that stat_summary() will plot data in the range of the original data values, don't use
xlim() to zoom in. You're actually throwing away all the data outside of the range defined in xlim() and
calculating the summary over the resulting subset. Likewise, if you log transform a scale, smoothing lines will be fit to the log
transformed data, not the original scale.

color and fill Scales

The second most common scale adjustment I use is to the color and fill aesthetics.

Usually, you map either continuous or discrete data to colors in a plot. The default scale for discrete data is
scale_color_hue(). One useful adjustment you can make to scale_color_hue() is to
name , which adjusts the legend title for color.

p
p + scale_color_hue(label = "Cylinders")

Some people don't like the default discrete colors. With scale_color_brewer() you can set the color pallete to
one of the RColorBrewer palletes. To see the possible options

library(RColorBrewer)
display.brewer.all()

I personally like Set1 for qualitative differences.

p + scale_color_brewer(pal = "Set1")

However, for this data we should probably consider one of the sequential palletes. The number of cylinders in an engine is an
ordered variable after all.

p + scale_color_brewer(pal = "Blues")

p + scale_color_brewer(pal = "OrRd")

The range of possibilities with continuous color variables is huge. The default continuous color scale is
scale_color_gradient(). This scale produces a smooth interpolation from the color indcated at the bottom of the
scale to the the color indicated at the top of the scale. By default this is a a gradient from blue to red.