Other sites

SparkR: Distributed data frames with Spark and R

R is now integrated with Apache Spark, the open-source cluster computing framework. The Databricks blog announced this week that yesterday's release of Spark 1.4 would include SparkR, "an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell". The SparkR 1.4 announcement led with the news:

Spark 1.4 introduces SparkR, an R API for Spark and Spark’s first new language API since PySpark was added in 2012. SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON. SparkR DataFrames support all Spark DataFrame operations including aggregation, filtering, grouping, summary statistics, and other analytical functions. They also supports mixing-in SQL queries, and converting query results to and from DataFrames. Because SparkR uses the Spark’s parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs.

This announcement is great news if you'd like to use the flexibility of the R language to perform fast computations on very large data sets. Unlike map-reduce in Hadoop, Spark is a distributed framework specifically designed for interactive queries and iterative algorithms. The Spark DataFrame abstraction is a tabular data object similar to R's native data.frame, but stored in the cluster environment. This conceptual similarity lends itself to elegant processing from R, using a syntax similar to dplyr. The SparkR blog post provides this example, which should seem familiar to dplyr users:

This new R package will give R users access to many of the benefits of the Spark framework, including the ability to import data from many sources into Spark, where it can be analyzed using the optimized Spark DataFrame system. Spark computations are automatically distributed across all the cores and machines available on the Spark cluster, so this package can be used to analyze terabytes of data using R.