13 Graphing with ggplot2

We’ve made some simple graphs earlier, in this lesson we will use the package ggplot2 to make simple and elegant looking graphs.

The ‘gg’ part of ggplot2 stands for ‘grammar of graphics’ which is the idea that most graphs can be made using the same few ‘pieces’. We’ll get into those pieces during this lesson. For a useful cheatsheet for this package see here

install.packages("ggplot2")

library(ggplot2)

When working with new data, It’s often useful to quickly graph the data to try to understand what you’re working with. It is also useful when understanding how much to trust the data.

The data we will work on is data about alcohol consumption in U.S. states from 1977-2016 from the National Institute of Health. It contains the per capita alcohol consumption for each state for every year. More details on the data are available here.

Their method to determine per capita consumption is amount of alcohol sold / number of people aged 14+ living in the state. We’ll return to this method at the end to discuss how much we trust the data.

Now we need to load the data.

load("data/apparent_per_capita_alcohol_consumption.rda")

The name of the file is quite long so for convenience let’s copy it to a new object with a better name.

alcohol <-apparent_per_capita_alcohol_consumption

The original data has every state, region, and the US as a whole. For this lesson we’re using data subsetted to just include states. For now let’s just look at Pennsylvania.

penn_alcohol <-alcohol[alcohol$state == "pennsylvania", ]

13.1 What does the data look like?

Before graphing, it’s helpful to see what the data includes. An important thing to check is what variables are available and the units of these variables.

So each row of the data is a single year of data for Pennsylvania. It includes alcohol consumption for wine, liquor, beer, and total drinks - both as gallons of ethanol (a hard unit to interpret) and more traditional measures such as glasses of wine. The original data only included the gallons of ethanol data which I converted to the more understandable units. If you encounter data with odd units, it is a good idea to convert it to something easier to understand - especially if you intend to show someone else the data or results!

13.2 Graphing data

To make a simple plot using ggplot(), all you need to do is specify the data set and the variables you want to plot. From there you add on pieces of the graph using the + symbol and then specify what you want added.

For ggplot() we need to specify 4 things

The data set - this is the very first thing you’ll write

The x-axis variable

The y-axis variable

The type of graph - e.g. line, point, etc.

Some useful types of graphs are

geom_point() - A point graph, can be used for scatter plots

geom_line() - A line graph

geom_smooth() - Adds a regression line to the graph
+geom_bar() - A barplot

13.3 Time-Series Plots

Let’s start with a time-series of beer consumption in Pennsylvania. In time-series plots the x-axis is always the time variable while the y-axis is the variable whose trend over time is what we’re interested in. When you see a graph showing crime rates over time, this is the type of graph you’re looking at.

The code below starts by writing our data set name. Then says what our x- and y-axis variables are called. The x- and y-axis variables are within parentheses of the function called aes(). aes() stands for aesthetic and what’s included inside here describes how the graph will look. It’s not intuitive to remember, but you need to included this.

ggplot(penn_alcohol, aes(x = year,
y = number_of_beers))

Note that on the x-axis it prints out every single year and makes it completely unreadable. That is because the “year” column is a character type, so R thinks each year is its own category. It prints every single year because it thinks we want every category shown. To fix this we can make the column numeric and ggplot() will be smarter about printing fewer years.

penn_alcohol$year <-as.numeric(penn_alcohol$year)

ggplot(penn_alcohol, aes(x = year,
y = number_of_beers))

When we run it we get our graph. It includes the variable names for each axis and shows the range of data through the tick marks. What is missing is the actual data. For that we need to specify what type of graph it is. We literally add it with the + followed by the type of graph we want. Make sure that the + is at the end of a line, not the start of one. Starting a line with the + will not work.

It looks like there’s a huge change in beer consumption over time. But look at where they y-axis starts. It starts around 280 so really that change is only ~60 beers. That’s because when graphs don’t start at 0, it makes small changes appear big. We can fix this by forcing the y-axis to begin at 0. We can add expand_limits(y = 0) to the graph to say that the value 0 must always appear on the y-axis, even if no data is close to that value.

Some other useful features are changing the axis labels and the graph title. Unlike in plot() we do not need to include it in the () of ggplot() but must use their own functions to add them to the graph.

This graph shows us that when liquor consumption increases, beer consumption also usually increases.

While scatterplots can help show the relationship between variables, we lose the information of how consumption changes over time.

Many time-series plots show multiple variables over the same time period (e.g. murder and robbery over time). There are ways to change the data itself to make creating graphs like this easier, but let’s stick with the data we currently have and just change ggplot().

A problem with this is that both lines are the same color. We need to set a color for each line, and do so within aes(). Instead of providing a color name, we need to provide the name the color will have in the legend. Do so for both lines.

13.4 Color blindness

Please keep in mind that some people are color blind so graphs (or maps which we will learn about soon) will be hard to read for these people if we choose the incorrect colors. A helpful site for choosing colors for graphs is colorbrewer2.org

This site let’s you select which type of colors you want (sequential and diverging such as shades in a hotspot map, and qualitative such as for data like what we used in this lesson). In the “Only show:” section you can set it to “colorblind safe” to restrict it to colors that allow people with color blindness to read your graph. To the right of this section it shows the HEX codes for each color (a HEX code is just a code that a computer can read and know exactly which color it is).

Let’s use an example of a color blind friendly color from the “qualitative” section of ColorBrewer. We have three options on this page (we can change how many colors we want but it defaults to showing 3): green (HEX = #1b9e77), orange (HEX = #d95f02), and purple (HEX = #7570b3). We’ll use the orange and purple colors. To manually set colors in ggplot() we use scale_color_manual(values = c()) and include a vector of color names or HEX codes inside the c(). Doing that using the orange and purple HEX codes will change our graph colors to these two colors.