Hands-on dplyr tutorial for faster data manipulation in R

I love dplyr. It's my "go-to" package in R for data exploration, data manipulation, and feature engineering. I use dplyr because it saves me time: its performance is blazing fast on data frames, but even more importantly, I can write dplyr code faster than base R code. Its syntax is intuitive and its functions are well-named, and so dplyr code is easy-to-read even if you didn't write it.

dplyr is the "next iteration" of the plyr package (focusing data frames, hence the "d"), and released version 0.1 in January 2014. It's being developed by Hadley Wickham (author of plyr, ggplot, devtools, stringr, and many other R packages), so you know it's a well-written, well-documented package.

Teaching dplyr using an R Markdown document

As one of the instructors for General Assembly's 11-week Data Science course in Washington, DC, I had 30 minutes in class last week to talk about data manipulation in R, and chose to focus exclusively on dplyr. When putting together my presentation, I had a lot of great material to draw from:

Using the hflights dataset (available on CRAN), I demonstrate the five basic dplyr "verbs," the chaining syntax, some of the more advanced functionality (such as window functions), a few of the new convenience functions that I find most useful (such as glimpse and summarise_each), and how to query a database using dplyr. I also compare many of the dplyr commands to the equivalent commands in base R. (Thanks to Hadley, because many of the examples I use are ones he wrote!)

Watch the dplyr tutorial on YouTube

After presenting, I recorded the entire presentation as a YouTube video (embedded below), since I know it can be helpful to hear someone explaining code that is unfamiliar to you. It runs 39 minutes, but if you only want to watch a particular section, simply click the topic below and it will skip to that point in the video.