dplyr 0.7 Made Simpler

I have been writing a lot (too much) on the R topics dplyr/rlang/tidyeval lately. The reason is: major changes were recently announced. If you are going to use dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into:

real world analyses,

reusable packages,

and teaching materials.

I think some of the apparent discomfort on my part comes from my feeling that dplyr never really gave standard evaluation (SE) a fair chance. In my opinion: dplyr is based strongly on non-standard evaluation (NSE, originally through lazyeval and now through rlang/tidyeval) more by the taste and choice than by actual analyst benefit or need. dplyr isn’t my package, so it isn’t my choice to make; but I can still have an informed opinion, which I will discuss below.

dplyr itself is a very powerful collection of useful data analysis methods or "verbs." In some sense it is a fairly pure expression of how you organize data transformations in functional programming terms. (By the way: data.table is probably an equally fundamental powerful formation in object oriented terms.)

In my opinion there are only two places where dplyr truly benefits from or actually needs the (often complicated and confusing) full power of non-standard evaluation: in the dplyr::mutate() and dplyr::summarize() verbs.

I admit: a system that can’t accept an arbitrary functions or expressions from the user lacks expressive power. However, the only place you truly need this power is when creating a new derived column in a data.frame. If you can do this then you can drive all of the other important data wrangling functions (row selection, row ordering, grouping, joining, and so on).

When I teach R, I teach you are going to have to copy your data at some point. You are fighting the R language if you try to completely avoid copying as you would in other more reference oriented languages. This is likely one of the reasons Nathan Stephens and Garrett Grolemund define "Big Data" as:

Big Data ~ ≥ 1/3 RAM.

Once you accept you are going to make copies (which is not part of all systems, but in my opinion is a part of R) then you should take advantage of the fact you are going to make copies. In particular you should land, materialize, or reify the results of complicated user expressions as actual data columns (i.e., propagate data forward, not propagate code forward). Doing this wastes some space, but can actually be easier to parallelize, potentially faster, easier to document, and much easier to debug.

The first form doesn’t waste much space (it adds a single new column among many) and is much easier to characterize and debug. By landing our filter criteria in a column it becomes data. Data is something we can reason about and process:

To help demonstrate and explore the expressive power of standard evaluation interfaces I am distributing a new small R package called seplyr (standard evaluation dplyr). seplyr is based on dplyr/rlang/tidyeval and is a thin wrapper that exposes equivalent standard evaluation interfaces for some of the more fundamental dplyr verbs ( group_by(), arrange(), rename(), select(), and distinct() ) and adds some of its own advanced verbs. It is similar to dplyr‘s now-deprecated "SE verbs", but with a more array and list oriented interface (de-emphasizing use of "..." in function arguments).

This standard evaluation interface isn’t so much a "more limited" version of dplyr, but a "more disciplined" approach to working withdplyr. We are using rlang/tidyeval, but that doesn’t mean the user has to see the rlang/tidyeval internals at all times.

For the most part we are passing work to dplyr using very small (and clear) functions. You can see how to use the new dplyr/rlang/tidyeval methods by printing the source code (for example: print(group_by_se)).

Also, in the development version of seplyr we are building up some exciting "complex standard evaluation verbs" such as add_group_indices() and add_group_sub_indices() which are best explained through their own documentation or an example: