SparkR just got better

In this document I'll give a quick review of some exiting new features for SparkR from the new 1.5 version that had it's official release on September 9th 2015.

Rstudio provisioned

Starting a full start SparkR service on AWS just got a whole lot easier. It takes about 15 minutes and only requires you to do three simple things.

Step 1: Command line

If you have your Amazon .pem files in order on your machine as well as your AWS credentials in your .bash_profile you can use the command line app in the spark-ec2 folder to start up a cluster on Amazon.

This example will start a cluster with 4 slave nodes of type c3.xlarge in the eu-west-1 region. The name of the cluster is 'my-spark-cluster'. If you want a bigger cluster, you can set the preferences here.

Step 2: Ssh + set password

The same command line app allows you to ssh into the master machine of the spark cluster.

Once you are logged in, you'll be logged in as the root user. For security reasons it is more preferable to have a seperate user for Rstudio. This user has already been added to the system, the only thing you need to do is assign a password to it.

passwd rstudio

Step 3: Login

First, just check where the master server is located.

> curl icanhazip.com

Then fill in the link in your favorite browser and log in with the 'Rstudio' user and the password that you've set.

You can use the startSpark.R script to quickly get started. It creates a sparkContext and a sqlContext which you can then use to create distributed dataframes on your cluster.

When you are done

If you are done with the cluster and saved all your files back to s3, you don't need to pay for the ec2 machines anymore. The cluster can then safely be destroyed via the same command line app as before.

In this new SparkR shell you will notice that the glm method now comes from multiple packages.

?glm
Help on topic ‘glm’ was found in the following packages:
Package Library
SparkR /Users/code/Downloads/spark-1.5.0-bin-hadoop2.6/R/lib
stats /Library/Frameworks/R.framework/Versions/3.2/Resources/library

You can run the model and save its parameters in a variable, just like in normal R.

dist_mod <- glm(weight ~ Time + Diet, data = ddf, family ="gaussian")

To view the characteristics of the regression you'll only need to run it through the summary function (a pattern common for many things in R).

Compare this output with the distributed glm output. You might be frightened slightly at this point. The two models give very different output!

Don't be scared just yet. The way that Spark has implemented machine learning is different, however it is still doing proper regression.

The main difference is that R-glm translates the Diet1 variable to be the constant intercept whereas the SparkR-glm translates Diet4. The only difference therefore is a linear transformation of the model, the prediction outcomes should still be the same. Another way to confirm this is to notice that the difference between Diet2 and Diet3 is the same in both models and the parameter for Time is also the same.

The logistic version of glm unfortunately doesn't give us a nice summary output at the moment. This is a known missing feature and should become available as of Spark 1.6. Another problem is that SparkR currently does not seem to support strings; which means that all classification tasks need to be cast to integers manually.

For most dataframe operations, this is nice to start playing with, but don't expect the full flexibility of base R just yet. Right now dates and datetimes cannot be summarized and they cannot be used in models. This is a current known issue in Jira that people are working on.

Levenshtein

These might fall in the special usecase category, but it can be surprisingly handy when trying to find similar names in a large databank.

Results

The future

Spark could use some more features, but the recent additions already grant many usecases. Spark is a project with enormous traction and has a new release every 3 months, so you can expect more to come.

Being able to work with a distributed dataframe in a dplyr-like syntax really opens up doors for R users who want to handle larger datasets. The fact that all of this can run on Amazon cheaply adds to the benefit.