New Features in 9.1: Microsoft R Server with sparklyr Interoperability

Introduction

With the launch of Microsoft R Server 9.1, many optimizations and new features were delivered to our users. One key feature is interoperability between Microsoft R Server and sparklyr.

sparklyr, a package by RStudio, is an R interface to Apache Spark. It allows users to utilize Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr also provides interfaces to Spark packages, allows users to query data in Spark using SQL, and develop extension in R by creating an interface to the full Spark API. Another key feature is it allows users the ability to use Spark integrated Machine Learning algorithms directly from within R. For H2O users, the Microsoft R Server sparklyr Interop can be used to covert sparklyr data frames to H2O data frames. This allows data imported from Microsoft R Server to be used with H2O modelling and data partitioning algorithms, via the rsparkling package. (to learn more about dplyr, please visit their CRAN site here.)

Microsoft R Server and sparklyr can now be used in tandem within a single Spark session. With this, Data Scientists and Solution Engineers can use all features of Microsoft R Server advanced Machine Learning algorithms on data prepared using the dplyr grammar.

Using Microsoft R Server with sparklyr

Prerequisites:

A Hadoop cluster with Spark and valid installation of Microsoft R Server

Microsoft R Server is configured for use with Hadoop and Spark (Instructions here)

Microsoft R Server SampleData loaded into HDFS

gcc and g++ installed on the edgenode that the example will be run on

Write permissions to the R Library Directory

Read/Write permissions to HDFS directory /user/RevoShare

An internet connection or the ability to download and manually install sparklyr and h2o

Note: if you are unfamiliar with using Microsoft R Server with Spark, please see here.

Note: To load SampleData into HDFS, please use one of the following was to load SampleData:

Microsoft R Server Functions:

Shell Script:

Installation of sparklyr package

If you on HDI, please install the sparklyr package in the following way:

Example One: Load and Partition Data in sparklyr, Train and Predict in MRS

In this example, we will:

Create a connection to Spark using rxSparkConnect(), specifying a sparklyr interop; using sparklyr and its interfaces to connect to Spark.

Call rxGetSparklyrConnection() on the compute context to get a sparklyr connection object.

We will use dplyr to load mtcars into a Spark DataFrame via the sparklyr connection object.

Partition the data in-Spark into a training and scoring set using dplyr.

After partitioning, we will register the training set DataFrame in Spark to a Hive table.

Train a model in ScaleR using rxLinMod() on an RxHiveData() object.

With that trained model, well will run a toy prediction using rxPredict() on the test partition.

After prediction, take the root mean square to determine accuracy.

Note: We will use the Standard R dataset mtcars for this example, for more information, please see here.

Sample Code

Sample Output, Comments Removed

Example Two: Load Data with MRS, Partition and Train a Model with sparklyr

In this example, we will:

Create a connection to Spark using rxSparkConnect(), specifying a sparklyr interop; using sparklyr and its interfaces to connect to Spark.

Call rxGetSparklyrConnection() on the compute context to get a sparklyr connection object.

Use Microsoft R Server to load data from many sources

Partition the data in-Spark into a training and scoring set using dplyr.

After partitioning, we will register the training set DataFrame in Spark to a Hive table.

Train a model using sparklyr to call Spark ML algorithms

Take a summary of the trained model to see estimates and errors

Sample Code

Sample Output, Comments Removed

Example Three: Connect to Spark and Load Data with MRS, Cache Data with dplyr, Train a Model and Predict with H2O, Gather Metrics with MRS

In this example, we will:

Create a connection to Spark using rxSparkConnect(), specifying a sparklyr interop; using sparklyr and its interfaces to connect to Spark.

Call rxGetSparklyrConnection() on the compute context to get a sparklyr connection object.

Use Microsoft R Server to load Training and Test Data from HDFS

Represent the Data as Hive Tables, and Cache the Tables in Spark

Cast the data to h2o data frames for analysis

Train a model using h2o's built in GLM algorithm

Print the model data

Run a prediction on the test data with h2o.predict

Take the ROC and Area Under the Curve (AUC) to see how our model did

Note: In the call to rxSparkConnect, we define numExecutors, executorCores, and executorMem. These are the minimum requirements to run this example. Allocating less memory to the Spark App may cause a hang on the call to as_h2o_frame().

Sample Code

Sample Output, Comments Removed

Conclusion

The ability to use both Microsoft R Server and sparklyr from within one Spark session will allow Microsoft R Server users to quickly and seamlessly utilize features provided by sparklyr within their solutions.

-----

Author: Kirill Glushko, Premal Shah

For a comprehensive view of all the capabilities in Microsoft R Server 9.1, refer to this blog