A new data processing workflow for R: dplyr, magrittr, tidyr, ggplot2

Over the last year I have changed my data processing and manipulation workflow in R dramatically. Thanks to some great new packages like dplyr, tidyr and magrittr (as well as the less-new ggplot2) I've been able to streamline code and speed up processing. Up until 2014, I had used essentially the same R workflow (aggregate, merge, apply/tapply, reshape etc) for more than 10 years. I have added a few improvements over the years in the form of functions in packages doBy, reshape2 and plyr and I also flirted with the package data.table (which I found to be much faster for big datasets but the syntax made it difficult to work with) — but the basic flow has remained remarkably similar. Until now…

Given how much I've enjoyed the speed and clarity of the new workflow, I thought I would share a quick demonstration.

In this example, I am going to grab data from a sample SQL database provided by Google via Google BigQuery and then give examples of manipulation using dplyr, magrittr and tidyr (and ggplot2 for visualization).

0) Install the packages!

In the examples, I am using the most up-to-date development versions of the packages – all of which come from GitHub. The links here provide instructions: dplyr (v0.4), magrittr (v1.5) and tidyr (v0.2). We also installed the latest version of bigrquery (v0.1). You should also have ggplot2 installed (we’re using v1.0). For the most part, using the versions on the R website (CRAN) will work fine, but there are a few new features that will not be available to you.

1) Grab sample data from Google’s BigQuery

The package dplyr has several great functions to connect directly to external databases. We discuss this functionality in our previous post on dplyr. One of the functions is designed to extract data from Google's BigQuery. This function (src_bigquery) uses code from the package bigrquery and I found that running the raw functions in bigrquery — query_exec in particular — worked more smoothly so that’s what I use below. Google has several nice sample tables to choose from. We will use the word index for the works of Shakespeare.

If you would rather skip the step of extracting data from Google, I've placed the full dataset on GitHub (warning that it’s relatively big, ~6mb). If you want to follow along exactly as you see below you will need an account with Google and you can find details on how to do this (it takes 5 minutes) here.

In this example, I write the SQL query as usual and execute using the function query_exec. The ‘dark-mark-818’ is the name of my project on Google, your project name will be different.

When you run the query_exec() function you will be asked if you want to cache your Google authorization credentials. Choose “Yes” by typing “1”. Your browser will open and you will need to click the “Accept” button after which the query will be processed. In my case, this took 34.7 seconds and the data looks like below.

You can see that this is a relatively simple table. Now we have the data we're ready to use the magic of dplyr, tidyr and magrittr.

2) dplyr: tools for working with data frames

There are multiple instances of words due to differences in case (lower, upper and proper case) and this gives us a very un-sexy introduction to the use of dplyr. Let's take a quick look at the repetition using one word which we know occurs frequently:

Here we use the filter verb to extract records and, yes it seems words are repeated, we need to do something about the repeated words. We will aggregate by lower case word, corpus and corpus date, summing all the instances of the word using dplyr‘s functions.

Better, each occurrence of “henry” is in a different work of Shakespeare now (no repetition). Now let's use dplyr here to take a quick look at what the most and least popular words are by computing the total times each word occurs across all of Shakespeare's works. We will use dplyr at its most basic to start, grouping by word, summing occurrences (total), counting the number of Shakespeare works they occur in (count) and arranging by total occurrences (in descending order).

OK, this worked like a charm and, in my opinion, is cleaner than using base functions (getting variable names right with aggregate, for example, is a pain) but there are a lot of unnecessary keystrokes and we create some unecessary interim objects (grp, cnts). Let's try a different way.

3a) magrittr: streamline your code with basic piping

Piping is built into dplyr. Originally the author of dplyr (Hadley Wickham) used piping of the form %.% but, starting with version 0.2, he adopted piping using magrittr (%>%). Note, though, that the new pipe %<>% is not in version 0.4 of dplyr so I load the package magrittr separately below. We can use pipes to re-write the code like this:

Much simpler! The basic idea is that the result from one operation gets piped to the next operation. This saves us the pain of creating interim objects and even saves us from having to give the interim pieces a name. See above where instead of:

summarize(grp, count=n(), total = sum(word_count))

we simply write

summarize(count=n(), total = sum(word_count))

We can do this because the pipe passes the interim object straight to the summarize function.

3b) magrittr: streamline more using the compound assignment pipe

You see above that the most common words in Shakespeare are dull (e.g., the, I, and, to, of). So let's see if the words might be more interesting if we limit to words with more then 4 characters and, crucially, limit to words that do NOT occur in all works of Shakespeare (eliminating 'the', 'I' etc).

Using the dplyr syntax we've already used we might write code like:

word.count <- filter(word.count, nchar(word)>4, count<42)

This works fine with one exception that has always bothered me. We are repeating the name of the table. Thanks to Stefan Milton Bache, the magrittr author, we can now drop these keystrokes using what he calls the compound assignment operator. This code can be re-written as:

This essentially says “take the shakespeare object and pipe it to filter. Then pipe it back and write over the original object”. Less code! The words are more interesting. They don't rank with some of Shakespeare's most interesting words (zounds, zwaggered) but better…

4) tidyr: tidy and re-structure your data

On my computer monitor I have a tiny bit of example code that reminds me how to use the base function reshape. I can never seem to remember how to specify the various arguments. The packages reshape and reshape2 by Hadley Wickham were designed to more simply restructure data. Hadley now has a new package, tidyr, designed to help re-arrange data and, importantly, to integrate with dplyr and magrittr.

In this mini-example, let's go back and filter the original data to just the top 8 words in our new list (and we will drop the corpus_date column).

Note that there is a nice argument called fill which can be used to assign missing values. In this case we should not have NAs, instead 0 would be more appropriate since this is the number of word occurrences:

Not only is this confusing and much more code but we still have work to do to rename the columns using gsub.

5) ggplot2: visualize your data

The package ggplot2 is not nearly as new as dplyr, magrittr and tidyr but it completes the list of components in my new workflow. We have a much more extensive post on ggplot2 (which happens to be our most popular post) and we will use some of those tricks to take a look at the Shakespeare data. We will focus on the total number of word occurrences against the number of works they occur in.

library(ggplot2)
ggplot(word.count, aes(count, total))+geom_point(color="firebrick")+
labs(x="Number of works a word appears in",
y="Total number of times word appears")

Given that we have some words that occur much, much more than others, let's use a log scale.

ggplot(word.count, aes(count, total))+geom_point(color="firebrick")+
labs(x="Number of works a word appears in",
y="Total number of times word appears")+
scale_y_log10()

Much nicer. This plot, though, is very deceptive! If you draw a smooth through the points you might be surprised by what it does at the low end:

ggplot(word.count, aes(count, total))+geom_point(color="firebrick")+
labs(x="Number of works a word appears in",
y="Total number of times word appears")+
scale_y_log10()+stat_smooth()

This is because words that appear in a limited number of works also occur much less overall. Let's try a couple of ways to see this starting with sizing the circles based on how many times a count/total combination occurs. We need to compute these stats and join (using dplyr‘s inner_join):

Definitely, it is now clear why the smooth dipped so much to 1/1. As a final step, let's label the words with the max occurrences in each bin. Plus, I'm curious what that word with very low total occurrences but occurs in most of Shakepeare's works is.

Summary

Up until last year my R workflow was not dramatically different from when I started using R more than 10 years ago. Thanks to several R package authors, most notably Hadley Wickham, my workflow has changed for the better using dplyr, magrittr, tidyr and ggplot2.

Next post

13 responses

Fantastic post, thanks! Great to see integration through the Hadley projects for a very efficient, clean workflow. I’ve been using dplyr and ggplot2 but haven’t incorporated tidyr yet. I am a big fan of the piping and hadn’t seen the compound assignment operator. That seems really convenient because I do just overwrite the data frame 90% of the time when doing dplyr operations.

However, when I try to do that, the counts are completely off. “the” for instance, is said to appear in 101 works, “and” in 90, etc, which makes no sense at all to me. I’m sure I messed up the syntax somewhere, but can’t figure out where. Thanks for any tips.

Yes, you can add mutate to the pipeline. Offhand, I’m not seeing the issue in your code but perhaps start with a “fresh” version of the shakespeare data.frame and run these two chunks (in order) and see if you still have problems.

Absolutely! Sorry, in my haste to respond to you I didn’t look at the question carefully enough. You can add the “mutate” piece of the code to the piping but you also have to add the other bits from the initial code — this includes two group_by statements so here it is with mutate:

Thanks. That looks like the missing piece. I’ll have to check whether using count = n_distinct(corpus) gives an equivalent answer (the first lines are at least the same). Using n_distinct for the corpus count seems more transparent to me than using two group group_by.

Thanks for the great tutorial. I’m running into some differences between the execution of the code on the site and my results. I have loaded all of the packages and when I execute the following bits of code:

I updated to the newest dplyr (0.5.0) and bigrquery (0.3.0) and the code still works. I don’t know of any changes. Perhaps start over with my exact code and see what happens. One thing to keep an eye out for is whether shakespeare is already a grouped table (use class(shakespeare)).