New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when Rdplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.

Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.

Example: the following should work across more than one dplyr back-end (such as RMySQL or RPostgreSQL).

Data types, capabilities, and row-orders all vary a lot as we switch remote data services. But the point of replyr is to provide at least some convenient version of typical functions such as: summary, nrow, unique values, and filter rows by values in a set.

This is a very new package with no guarantees or claims of fitness for purpose. Some implemented operations are going to be slow and expensive (part of why they are not exposed in dplyr itself).

We will probably only ever cover:

Native data.frames (and tbl/tibble)

RMySQL

RPostgreSQL

SQLite

sparklyr 2.0.0

The main useful functions we supply are replyr::replyr_filter and replyr::replyr_inTest which are designed to subset data based on a columns values being in a given set. These allow selection of rows by testing membership in a set (very useful for partitioning data). Example below:

I would like this to become a bit of a "stone soup" project. If you have a neat function you want to add please contribute a pull request with your attribution and assignment of ownership to Win-Vector LLC (so Win-Vector LLC can control the code, which we are currently distributing under a GPL3 license) in the code comments.

There are a few (somewhat incompatible) goals for replyr:

Providing missing convenience functions that work well over all common dplyr service providers. Examples include replyr_summary, replyr_filter, and replyr_nrow.

Providing a basis for "row number free" data analysis. SQL back-ends don’t commonly supply row number indexing (or even deterministic order of rows), so a lot of tasks you could do in memory by adjoining columns have to be done through formal key-based joins.

Good code should fill one important gap and work on a variety of dplyr back ends (you can test RMySQL, and RPostgreSQL using docker as mentioned here and here; sparklyr can be tried in local mode as described here). I am especially interested in clever "you wouldn’t thing this was efficiently possible, but" solutions (which give us an expanded grammar of useful operators), and replacing current hacks with more efficient general solutions. Targets of interest include sample_n (which isn’t currently implemented for tbl_sqlite), cumsum, and quantile.

Right now we have an expensive implementation of quantile based on binary search.

replyr_quantile(dRemote,'x')
# 0 0.25 0.5 0.75 1 # 1 1 2 2 2

Some primitives of interest include:

cumsum or row numbering (interestingly enough if you have row numbering you can implement cumulative sum in log-n rounds using joins to implement pointer chasing/jumping ideas, but that is unlikely to be practical, lag is enough to generate next pointers, which can be boosted to row-numberings).

Random row sampling (like dplyr::sample_n, but working with more service providers).

Inserting random values (or even better random unique values) in a remote column. Most service providers have a pseudo-random source you can use.