Using Spark SQL for ETL

With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new technologies quickly without learning a new query syntax for basic retrievals, joins, and aggregations.

Amazon EMR is a managed service for the Hadoop and Spark ecosystem that allows customers to quickly focus on the analytics they want to run, not the heavy lifting of cluster management.

In this post, we demonstrate how you can leverage big data platforms and still write queries using a SQL-style syntax over data that is in different data formats within a data lake. We first show how you can use Hue within EMR to perform SQL-style queries quickly on top of Apache Hive. Then we show you how to query the dataset much faster using the Zeppelin web interface on the Spark execution engine. Lastly, we show you how to take the result from a Spark SQL query and store it in Amazon DynamoDB.

Hive and Spark SQL history

For versions <= 1.x, Apache Hive executed native Hadoop MapReduce to run the analytics and often required the interpreter to write multiple jobs that were chained together in phases. This allowed massive datasets to be queried but was slow due to the overhead of Hadoop MapReduce jobs.

SparkSQL adds this same SQL interface to Spark, just as Hive added to the Hadoop MapReduce capabilities. SparkSQL is built on top of the Spark Core, which leverages in-memory computations and RDDs that allow it to be much faster than Hadoop MapReduce.

Spark integrates easily with many big data repositories. The following illustration shows some of these integrations.

Using SparkSQL for ETL

In the second part of this post, we walk through a basic example using data sources stored in different formats in Amazon S3. Using a SQL syntax language, we fuse and aggregate the different datasets, and finally load that data into DynamoDB as a full ETL process.

The table below summarizes the datasets used in this post.

Create a table in Hive/Hue

Hive and SparkSQL let you share a metadata catalogue. This allows you to create table definitions one time and use either query execution engine as needed. All table definitions could have been created in either tool exclusively as well.

First, launch an EMR cluster with Hive, Hue, Spark, and Zeppelin configured. It’s recommended that you run a cluster with at least four core nodes if the default instance size is m3.xlarge.

After you create the array, the genres appear in the sample data browser.

Transform the data using SparkSQL/Zeppelin

Now interact with SparkSQL through a Zeppelin UI, but re-use the table definitions you created in the Hive metadata store. You’ll create another table in SparkSQL later in this post to show how that would have been done there.

Connect to the Zeppelin UI and create a new notebook under the Notebook tab. Query to show the tables. You can see that the two tables you created in Hive are also available in SparkSQL.

%sql SHOW tables

Using SparkSQL, you can perform the same query as you did in Hive in a previous step.

This time, the query returned within a couple seconds so that analysts can quickly interact with the large data set in the RDD.

Suppose you want the same information as the previous query, but this time broken out by the top five movies for males and the top five for females. To do this, bring in the data set user-details. This data set contains information such as gender and occupation. This data set is pipe delimited.

This query combines two queries in a union statement. The first query gets the five top-rated movies for males using all three datasets and then combines the results with the five top-rated movies for females:

Because the ratings table is still cached in the SparkContext, the query happens quickly (in this case, four seconds).

Load the transformed data into DynamoDB

Next, create a new DynamoDB table that saves the number of ratings that users voted on, per genre and rating number. To query this, you first need to figure out which movies were voted on. Combine that information with the movie details data and figure out the movie’s genres to know how are users voting per genre.

The following SQL statement queries for that information and returns the counts:

The query result is stored in a Spark DataFrame that you can use in your code.

After you have the DataFrame, perform a transformation to have an RDD that matches the types that the DynamoDB custom output format knows how to write. The custom output format expects a tuple containing the Text and DynamoDBItemWritable types.

This last call uses the job configuration that defines the EMR-DDB connector to write out the new RDD you created in the expected format:

ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)

Conclusion

EMR makes it easy to run SQL-style analytics in both Spark and Hive. As this post has shown, connectors within EMR and the open source community let you easily talk to many data sources, including DynamoDB. Rather than focusing on standing up the software and managing the cluster, with EMR you can quickly process and analyze your data and store the results in destinations such as NoSQL repositories and data warehouses.