Running Spark on R with SparkR

R is still one of the most powerful languages for data scientists, and the bar was raised even further at the end of January 2014 when UC Berkeley’s AMPLab announced a developer preview of their new project SparkR to use Apache Spark natively from R.

A Big Data framework for in-memory data processing at scale, Apache Spark has been gaining a lot of traction lately as big companies likes Cloudera are throwing their weight behind the project. Cloudera recently announced that Spark is now officially supported in its Cloudera Distribution for Hadoop (CDH) from version 4.4.0 onwards. This includes the most recent release of Spark 0.9 which was released in February, and is a pre-requisite for SparkR. SparkR comes at the right time as CDH is one of the most popular Hadoop distributions, so this will help drive adoption towards the data science crowd which may be more familiar with R than Java or Scala, as shown by a recent survey of data scientists by O'Reilly.

SparkR should be seen as a lightweight frontend to use Spark from R, meaning it will not have an API as extensive as the Scala or Java bindings, but will be sufficient to run Spark jobs from R and manipulate data. One of its key features is the ability to serialize closures, which in turn transparently copies variables to a Spark cluster if they are needed in a computation. SparkR also integrates with other R modules via a built-in function that can tell the Spark cluster to load a particular module needed for a computation, but, unlike closures, this needs to be specified manually. More details around the technical capabilities of SparkR can be found in this summary. SparkR can also take advantage of Spark's EC2 scripts to be easily setup on EC2, and some instructions around that can be found on Github.

The data science crowd has been pretty vocal about SparkR, and Twitter in particular had many support messages for the project. Alex Pinto, lead at MLSecProject, for example tweeted the following:

This is very promising: SparkR by @amplab. Puts together my favorite things for data analysis.

The project is on Github and already has a pretty active community with close to 100 stars. Considering that the project is barely a month old, this is some significant growth. There are also several open issues, meaning the community is actively involved in this new open-source project.

The AMPLab team has expressed interest in the future to integrate SparkR with Spark's MLlib machine learning library so that algorithms can be parallelized seamlessly without having to specify manually which part of the algorithm can be run in parallel. MLlib is one of the components in a larger machine learning project called MLBase which also includes higher-level abstractions and an optimizer. MLlib is one of the fastest growing machine learning libraries with more than 137 contributors, so adding the ability to use it from R makes a lot of sense for AMPLab to ensure contributions to MLlib from R users.