sparklyr 0.7

Kevin Kuo

2018-01-29

We are excited to share that sparklyr 0.7 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark. You can also learn more about Apache Spark and sparklyr in spark.rstudio.com and our new webinar series on Apache Spark. Features in this release:

Adds support for ML Pipelines which provide a uniform set of high-level APIs to help create, tune, and deploy machine learning pipelines at scale.

Enhances Machine Learning capabilities by supporting the full range of ML algorithms and feature transformers.

Improves Data Serialization, specifically by adding support for date columns.

In this blog post, we highlight Pipelines, new ML functions, and enhanced support for data serialization. To follow along in the examples below, you can upgrade to the latest stable version from CRAN with:

install.packages("sparklyr")

ML Pipelines

The ML Pipelines API is a high-level interface for building ML workflows in Spark. Pipelines provide a uniform approach to compose feature transformers and ML routines, and are interoperable across the different Spark APIs (R/sparklyr, Scala, and Python.)

First, let’s go over a quick overview of terminology. A Pipeline consists of a sequence of stages—PipelineStages—that act on some data in order. A PipelineStage can be either a Transformer or an Estimator. A Transformer takes a data frame and returns a transformed data frame, whereas an Estimator take a data frame and returns a Transformer. You can think of an Estimator as an algorithm that can be fit to some data, e.g. the ordinary least squares (OLS) method, and a Transformer as the fitted model, e.g. the linear formula that results from OLS. A Pipeline is itself a PipelineStage and can be an element in another Pipeline. Lastly, a Pipeline is always an Estimator, and its fitted form is called PipelineModel which is a Transformer.

Let’s look at some examples of creating pipelines. We establish a connection and copy some data to Spark:

Then, we can create a new Pipeline with ml_pipeline() and add stages to it via the %>% operator. Here we also define a transformer using dplyr transformations using the newly available ft_dplyr_transformer().

Under the hood, ft_dplyr_transformer() extracts the SQL statements associated with the input and creates a Spark SQLTransformer, which can then be applied to new datasets with the appropriate columns. We now fit the Pipeline with ml_fit() then transform some data using the resulting PipelineModel with ml_transform().

A predictive modeling pipeline

Now, let’s try to build a classification pipeline on the iris dataset.

Spark ML algorithms require that the label column be encoded as numeric and predictor columns be encoded as one vector column. We’ll build on the pipeline we created in the previous section, where we have already included a StringIndexer stage to encode the label column.

Model persistence

Another benefit of pipelines is reusability across programing languages and easy deployment to production. We can save a pipeline from R as follows:

ml_save(prediction_model, "path/to/prediction_model")

When you call ml_save() on a Pipeline or PipelineModel object, all of the information required to recreate it will be saved to disk. You can then load it in the future to, in the case of a PipelineModel, make predictions or, in the case of a Pipeline, retrain on new data.

Machine learning

Sparklyr 0.7 introduces more than 20 new feature transformation and machine learning functions to include the full set of Spark ML algorithms. We highlight just a couple here.

Bisecting K-means

Bisecting k-means is a variant of k-means that can sometimes be much faster to train. Here we show how to use ml_bisecting_kmeans() with iris data.