Other sites

A case against pipes in R and what to do instead

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A case against pipes in R and what to do instead

Pipes (%>%) are great for improving readibility of lengthy data
processing scripts, but I’m beggining to learn they have some weaknesses
when it comes to large and complex data processing.

We are running a number of projects at the moment that require managing
and wrangling large and complex datasets. We have numerous scripts we
use to document our workflow and the data wrangling steps. This has
turned out to be very helpful, because when we identify bugs in the end
product, we can go back and fix them.

But I’m starting to see a pattern. Most of the really insidious bugs
occur in sections of code that use dplyr tools and pipes. These are
always the types of bugs that don’t throw an error, so you get a result,
it just turns out to be wrong. They are the worst kind of bugs. And hard
to detect and fix.

So we are now moving away from using pipes in complex scripts. For
simple scripts I intend to keep using them, they are so fast and easy.
Here’s what we’re trying instead.

The problem with pipes

So here’s some made up data that mimics the kind of fish survey data we
often have:

sites

So we have site level data with a covariate, temp and transect level
data with fish counts.

Now say we have an error and one of our sites has capitals, instead of
lower case, so lets introduce that bug:

Obvious enough here, but issues like that are much harder to detect in
very large datasets.

Unit testing

The solution of course is to code in ‘unit tests’ to make sure each
operations are doing what you expect. For small data you can just look,
but for big datasets its not so easy.

For long pipes with multiple steps we’d usually do this debugging and
testing interactively. So I’d write the first line (the join) save the
output to a new variable, check it worked ok, then move on to write the
next step of the pipe.

Now here’s the catch. In complex project its common to change the data
that goes into your pipe (in this case dat or sites dataframes). For
instance, in our current project new data comes in all the time.

New data presents new issues. So a pipe that worked the first time may
no longer work the second time.

This is why it is crucial to have unit tests built into your code.

There are lots of sophisticated R packages for unit testing, including
ones that work with pipes. But given many of us are just learning tools
like dplyr its not wise to add extra tools. So here I’ll show some
simple unit tests with base R.

Unit testing an example

Joins often case problems, due to mis-matching (e.g. if site names are
spelt differently in different datasets, which is a very common human
data entry error!).

So its wise to check the join has worked. Here’s some examples:

dat2

Now compare number of rows:

nrow(dat2)
## [1] 16
nrow(dat)
## [1] 20

Obviously the join has lost data in this case.

We can do better though with a complex script. We’d like to have an
error if the data length changes. We can do this:

nrow(dat2) == nrow(dat)
## [1] FALSE

Which tells us TRUE/FALSE if the condition is met. To get an error usestopifnot

stopifnot(nrow(dat2) == nrow(dat))

Common unit tests for data wrangling

Of the top of my head here are a few of my most commonly used unit tests
To check the number of sites has stayed the same, uselength(unique(… to get the number of unique cases:

It looks like site a has twice as many fish as it really does (78,
when it should have 39). So imagine you had a site dataframe you were
happy worked, then your collaborator sent you a new one to use, but it
had duplicate rows. If you didn’t have the unit test to check your join
in place, you may never know about this doubling of data error.