difficulty

Contents

Assumptions

This lesson makes a few assumptions about your understanding of R. If
you have not completed the R Basics with Tabular
Data
lesson, I suggest you complete that first. Having a background in
another programming language will also be beneficial. If you need a
place to start, I recommend working through the Programming
Historian’s excellent Python tutorials.

Lesson Goals

By the end of this lesson, you will:

Know how to organize data to be “tidy” and why this is important.

Understand the dplyr package and use it to manipulate and wrangle
with data.

Become acquainted with the pipe operator in R and observe how it can
assist you in creating more readable code.

Learn to work through some basic examples of data manipulation to
gain a foundation in exploratory data analysis.

Introduction

Data you find “in the wild” will rarely be in a format necessary for
analysis, and you will need to manipulate it before exploring the
questions you are interested in. This may take more time than doing the
analysis itself! In this tutorial, we will learn some basic techniques
for manipulating, managing, and wrangling with our data in R.
Specifically, we will rely on the philosophy of “tidy
data” as articulated by
Hadley Wickham.

According to Wickham, data is “tidy” when it meets
three key criteria:

Each observation is in a row.

Each variable is in a column.

Each value has its own cell.

Being observant of these criteria allows us to recognize when data is
organized or unorganized. It also provides us a standardized schema and
set of tools for cleaning up some of the most common ways that datasets
are “messy:”

Column headers are values, not variable names.

Multiple variables are stored in one column.

Variables are stored in both rows and columns.

Multiple types of observational units are stored in the same table.

A single observational unit is stored in multiple tables.

Perhaps most importantly, keeping our data in this format allows us to
use a collection of packages in the
“tidyverse,” which are designed to specifically
work with tidy data. By making sure that our input and output are tidy,
we only have to use a small set of tools to solve a large number of
questions. In addition, we can combine, manipulate, and split tidy
datasets as we see fit.

In this tutorial, we will be focusing on the
dplyr
package of the tidyverse, but it is worth briefly mentioning some others
we will be running into:

magittr–This package gives us
access to the forward pipe operator and makes our code easier to read.
ggplot2–This package utilizes the
“Grammar of Graphics”
to provide an easy way to visualize our data.
readr–This package makes available a
faster and more streamlined method of importing rectangular data, such
as csv files.
tibble–This package provides us
access to a reconceptualization of data frames that are easier to work
with and print.

If you have not already done so, you should install and load the
“tidyverse” before beginning. In addition, make sure that you have the
latest version of R and the latest version
of R Studio
installed for your respective platform.

Copy the following code into RStudio. To run it, you need to highlight
the lines and press Ctrl+Enter (Command+Enter on Mac OS):

# Install tidyverse libraries and load it
# Do not worry if this takes a while
install.packages("tidyverse")
library(tidyverse)

An Example of dplyr in Action

Let’s go through an example to see how dplyr can aid us as historians by
inputting U.S. decennial census data from 1790 to 2010. Download the
data by clicking
here
and place it in the folder that you will use to work through the examples
in this tutorial.

Since the data is in a csv file, we are going to use the read_csv()
command in tidyverse’s
readr
package.

The read_csv function takes the path of a file we want to import from
as a variable so make sure that you have it set up correctly.

# Import CSV File and save to us_state_populations_import
# Make sure you set the path of the file correctly
us_state_populations_import<-read_csv("introductory_state_example.csv")

After you import the data, you will notice that there are three columns:
one for the population, one for the year, and one for the state. This
data is already in a tidy format providing us a multitude of options for
further exploration.

For this example, let’s visualize the population growth of California and
New York to gain a better understanding of Western migration. We will
use dplyr to filter our data so that it only contains information about
the states we are interested in, and we will use ggplot2 to visualize
this information. This exercise is just to provide you a taste of what
dplyr can do, so don’t worry if you don’t understand the code at this
time.

# Filter to California and New York states only
california_and_new_york_state_populations<-us_state_populations_import %>%
filter(state %in% c("California", "New York"))
# Plot California and New York State Populations
ggplot(data=california_and_new_york_state_populations, aes(x=year, y=population, color=state)) +
geom_line() +
geom_point()

Graph of California and New York population

As we can see, the population of California has grown considerably
compared to New York. While this particular example may seem obvious
given the history of U.S. migration, the code itself provides us a
foundation that we can build on to ask a multitude of similar questions.
For instance, with a quick change of code, we can create a similar graph
with two different states such as Mississippi and Virginia.

Quickly making changes to our code and reanalyzing our data is a
fundamental part of exploratory data analysis (EDA). Rather than trying
to “prove” a hypothesis, exploratory data analysis helps us understand
our data better and ask questions about it. For historians, EDA provides
an easy means of knowing when to dig deeper into a subject and when to
step back, and it is an area where R excels.

Pipe Operator

Before looking at dplyr, we need to go over the pipe operator (%>%)
in R since we will often run into it in our examples. As mentioned
earlier, the pipe operator is part of the
magrittr
package created by Stefan Milton Bache and
Hadley Wickham and is included in the tidyverse.
Its name is an homage to surrealest painter Rene Magritte, whose “The
Treachery of Images” famously depicted a pipe with the words “this is
not a pipe” underneath in French.

The pipe operator allows you to pass what is to the left of the pipe as
the first variable in a function specified on the right. Although it may
seem strange at first, once you learn it, you will find that it makes
your code more readable by avoiding nested statements. Don’t worry if
all this is a little confusing right now. It will become more clear as
we go through the examples.

Let’s say that we are interested in getting the square root of each
population value and then summing all the square roots before getting
the mean. Obviously, this isn’t a useful measurement, but it demonstrates
just how quickly R code can become difficult to read. Normally, we would
nest such statements:

As you can see, with enough nested commands, it is hard to remember how
many parenthesis you need and makes the code awkward to read. To mitigate
this, some people may create temporary vectors in between each function
call.

# Get square root of all the state populations
sqrt_state_populations_vector<-sqrt(us_state_populations_import$population)
# Get sum of all the sqrts of the temporary variable
sum_sqrt_state_populations_vector<-sum(sqrt_state_populations_vector)
# Get mean of the temporary variable
mean_sum_sqrt_state_populations_vector<-mean(sum_sqrt_state_populations_vector)
# Display the mean
mean_sum_sqrt_state_populations_vector
## [1] 1256925

Although you get the same answer, this is a lot more readable. However,
it can quickly clutter your workspace if you forget to delete the
temporary vectors. The pipe operator does all this for you. Here is the
same code using the pipe operator.

We Need a New Dataset

Now that we have an understanding of the pipe operator, we are ready to
begin looking at and wrangling with some data. Unfortunately, for
historians, there are only a few easily available datasets–perhaps you
can help change this by making yours available to the public! We are
going to rely on the history
data package
created by Lincoln Mullen.

This packages contains samples of historical datasets–the earlier U.S.
Census data sample was taken from this package. Throughout this
tutorial, we are specifically going to work with the early_colleges
dataset that contains data about colleges founded before 1848. Lets
start by loading the data and view it.

As you can observe, this dataset contains the current name of the college, its
original name, the city and state where it was founded, when the college was
established, and its sponsorship. As we discussed earlier, before we can
work with a dataset, it is important to think about how to organize the
data. Let’s see if any of our data is not in a “tidy” format. Do you see
any cells that do not match the three criteria for tidy data?

If you guessed the sponsorship of Harvard, you are correct. In addition
to noting the original sponsorship, it also mentions that it changed
sponsorship in 1805. Usually, you want to keep as much information about
your data that you can, but for the purposes of this tutorial, we are
going to change the column to only have the original sponsorship.

Now that we have our data in a tidy format, we can shape it through the
dplyr package.

What is Dplyr?

Dplyr
is another part of the tidyverse that provides functions for
manipulating and transforming your data. Because we are keeping our data
“tidy,” we only need a small set of tools to explore our data. Compared
to base R, using dplyr is often faster, and guarantees that if our
input is tidy then our output will also be tidy. Perhaps most importantly,
dplyr makes our code easier to read and utilizes “verbs” that are, in
most cases, intuitive. Each function in dplyr corresponds to these verbs,
with the five key ones being filter, select, arrange, mutate, and
summarise–dplyr uses the British spelling. Let’s go through each of them
individually to see how they work in practice.

Select

If we look at the early_colleges data, we can observe that there are a
lot of NA’s in the original names column. NA signifies that the data is
not available, and we may want to view our data with this column
removed. dplyr’s select() function gives us the ability to do this. It
takes the data frame you want to manipulate as the first argument, followed by a
list signifying which columns you would like to keep:

Filter

The filter() function does the same thing as the select function but
rather than choosing the column name, we can use it to filter rows using
a test requirement. For instance, we can view all the colleges that
existed before the turn of the century.

Mutate

The mutate command allows you to add a column to your data frame. Right
now, we have the city and state in two separate columns. We can use the
paste command to combine two strings and specify a seperator. Let’s place
them in a single column called “location.”

Again, you need to remember that dplyr does not save the data or
manipulate the original. Instead, it creates a temporary data frame at
each step. If you want to keep it, you need to create a permanent
variable.

Arrange

The arrange() function allows us to order our columns in a new way.
Currently, the colleges are organized by year in ascending order. Lets
place them in descending order of establishment, in this case, from the end of the Mexican-American War.

Summarise

The last key function in dplyr is summarise()–note the British
spelling. Summarise() takes a function or operation, and is usually used
to create a data frame that contains summary statistics for plotting. We
will use it to calculate the average year that colleges before 1848 were
founded.

Putting it All Together

Now that we have gone through the five main verbs for dplyr, we can use
them to create a quick visualization of our data. Let’s go ahead and
create a bar graph showing the number of secular and non-secular
colleges founded before the U.S. War of 1812:

Conclusion

This tutorial should put you well on the way to thinking about how to
organize and manipulate your data in R. Later, you will probably want to
graph your data in some way. I recommend that you begin looking at the
ggplot2 package for a set of tools that work
well with dplyr. In addition, you may want to examine some of the
other functions that come with dplyr to hone your skills. Either way,
this should provide a good foundation to build on and cover a lot of
the common problems you will encounter.

About the author

Nabeel Siddiqui is a doctoral candidate in American Studies at the College of William and Mary, where he is also an inaugural fellow of the Digital Equality Lab. His research focuses broadly on the relationship between the humanities and computation, with a particular expertise on the cultural history of media, new media studies, digital humanities, information studies, and the history of science and technology. Currently, he is completing his dissertation entitled "Byting Out the Public," which looks at the historical relationship between the personal computer and the private sphere.