Other sites

replyr: Get a Grip on Big Data in R

replyr is an R package that contains extensions, adaptions, and work-arounds to make remote Rdplyr data sources (including big data systems such as Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory data.frame, SQLite, PostgreSQL, and Spark2 currently being the primary supported platforms).

Example

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Either through print() (which is much safter with tbl_df derived classes, than with base data.frame).

As we see, replyr summary returns data in a data frame, and can deal with multiple column types.

Note: the above summary has problems with NA in character columns with Spark, and thus is mis-reporting the NA count in the tailum column. We are working on the issue. That is also one of the advantages of taking your work-arounds from a package: when they do improve you can easily incorporate bring the improvements into your own work by a mere package update.

We could also use dplyr::summarize_each for the task, but it has the minor downside of returning the data in a wide form.

Special code for remote data is needed as none of the obvious "one liner" candidates (base::summary(), or broom:glance()) are not currently (as of March 4, 2017) intended to work with remote data sources.