Spark for Big Data Analytics [Part 3]

Spark for Big Data Analytics [Part 3]

Next up in the Oceans of Data series covering Apache Spark, I will explore analyzing big data with SparkR, MLlib and H2O for data science and machine learning. In previous articles, I introduced the rise of Apache Spark in Part 1 and began analyzing data in Part 2 with Spark SQL.

Data Scientist Workbench

For analytics enthusiasts that love to learn new technologies in a “hands-on” manner, I found a wonderful free data scientist workbench online. I totally love it and warn you that it is addictive! The data scientist workbench includes Open Refine for data preparation, Jupyter and Zeppelin analytics notebooks, RStudio IDE pre-cooked into an Apache Spark cluster, Seahorse for Visual Spark machine learning, and text analytics. There are free step-by-step tutorials and free classes. If you have time over the holiday break, enjoy these awesome analytics gifts.

Introduction to SparkR

SparkR is an Apache Spark cluster computing framework for large scale data processing. It provides a light-weight front-end to users of Apache Spark from R. Spark is best known for its ability to cache large data sets in memory. It’s default API is simpler than MapReduce.

SparkR is multi-threaded and can be executed across many machines. SparkR uses Spark’s distributed computation engine to allows us to run large-scale data analysis from an R shell. In Spark 1.61, SparkR added a distributed dataframe implementation that is a key component. Concurrently RStudio embraced Spark in the latest release. Now we see multiple vendors cooking RStudio directly into big data offerings allowing our R skills to easily upgrade into the big data analytics world with minimal learning curve.

There are several ways to work with SparkR. Spark-land sessions start life with a context. You can use the Spark shell to launch a SparkR shell. The difference between the two is the SparkR shell has the SQLContext and SparkContext. You can also use RStudio integrated development environment or analytics notebooks including Jupyter and Zepplin.

Dataframes are a fundamental data structure used for data processing in R and SparkR. They are conceptually similar tables in a relational database. Dataframes use the distributed parallel capabilities offered by Spark RDDs described in the previous Spark series article.

SparkR’s distributed dataframe implementation supports R operations like selection, filtering, aggregation, and so on for large data set analysis. Dataframes can also be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing local R data frames.

There are several ways to create a dataframe in SparkR.

Convert a local R dataframe into a Spark R dataframe

Created from raw data by creating a list of lists

Creating a dataframe requires a schema structure. One method of creating dataframes from data sources is to use read.df. This method takes in the SQLContext, the path for the file to load, and type of data source. SparkR supports reading JSON natively, and through Spark packages, you can find data source connectors for popular file formats like csv.

When using SparkR you may run into method name conflicts. To work around it, fully qualify method names using ::

To select columns in SparkR you can use select to get a list of column names or selectExpr to run a string containing a SQL expression. SparkR also provides a number of functions that can directly be applied to columns in data processing and during aggregation. To specify a column use the R $ syntax i.e. cars$mpg.

You can also perform the selections and filtering by using standard SQL queries with a SparkSQL dataframe. To do that, we register the table and use a SparkR SQL function with SQLContext. For example, SELECT cars$mpg FROM cars WHERE cyl > 4.

SQL queries should be comfortable for most analytics professionals. If not, stop reading this article and start learning how to query data with SQL. SparkR defines the following SQL aggregation functions that you can apply to dataframe object calls: avg, min, max, sum, count, count distinct, and some distinct. User defined functions are also available as of SparkR 2.0.

R projects such as dplyr for data preparation and shaping have also been optimized for complex data manipulation tasks on SparkR dataframes. SparkR also supports distributed machine learning using MLlib.

SparkR for Machine Learning

Machine learning identifies or infers patterns in data sets. A machine learning algorithm takes data as input and it produces a model as output. The model can be used to make predictions and score data for reporting or app integration.

Within SparkR for machine learning, thare is SparkML, Spark MLlib and other projects such as H20 Sparkling Water. Spark MLlib is a module, library, an extension of Apache Spark for machine learning algorithms on top of SparkRs RDD abstraction. SparkML is a module to provide distributed machine learning algorithms on top of Spark dataframes.

Spark MLlib contains the original API built on top of RDDs

Spark ML is a higher-level API built on top of dataframes

When developing machine learning with Spark, you will create a pipeline or sequence of stages contains a transformer or an estimator step. A transformer is an algorithm which can transform one dataframe into another dataframe. For example, an ML model is a transformer that combines a dataframe with model features with predictions. An estimator is an algorithm that produces a transformer. We also have parameters. All transformers and estimators share a common API for specifying parameters – or algorithm settings per se.

The rsparkling extension package provides bindings to H2O’s distributed machine learning algorithms via sparklyr. The rsparkling option allows you to access the machine learning routines provided by the Sparkling Water Spark package. This solution provides a few simple conversion functions that allow you to transfer data between Spark DataFrames and H2O Frames.

A typical machine learning pipeline with rsparkling includes the following steps.

Perform SQL queries via sparklyr dplyr interface.

Use the sdf_* and ft_* functions to generate new columns, or partition your data set into training, validation and test dataframes.

Convert training, validation and/or test dataframes into H2O frames.

Choose an appropriate H2O machine learning algorithm for modeling.

Evaluate model fit, lift and apply it to make predictions with new data.

H2O is impressive. Check out the web site and github projects. H2O was written from scratch in Java and seamlessly integrates with Apache Hadoop and Spark to solve challenging big data analytics problems. It provides an easy-to-use web design experience and interfaces with R, Python, Java, Scala, JSON via APIs. One awesome feature I appreciate is the visual models evaluation. Real-time data scoring is also pretty cool.

That wraps up this article on SparkR. Stay tuned for more deep dives using these incredible open source big data analytics technologies.

Tags

Jen Underwood is a Senior Director at DataRobot and founder of Impact Analytix, LLC. She has a unique blend of product management and “hands-on” experience in data warehousing, reporting, visualization, and advanced analytics. In addition to keeping a constant pulse on industry trends, she enjoys digging into oceans of data to solve complex problems with machine learning.
Over the past 20 years, Jen has held worldwide product management roles at Microsoft and served as a technical lead for system implementation firms. She has experience launching new products and turning around failed projects. Most recently she provided advisory, strategy, educational content development, and marketing services to 100+ technology vendors through her own firm. She has been mentioned by KD Nuggets, Information Management and Forbes for her work. She also has written for InformationWeek, O’Reilly Media, and numerous other tech industry publications.
Jen has a Bachelor of Business Administration – Marketing, Cum Laude from the University of Wisconsin, Milwaukee and a post-graduate certificate in Computer Science – Data Mining from the University of California, San Diego. She was also honored to be a former IBM Analytics Insider, Tableau Zen Master, and Top 10 Women Influencer.