Random stuff from a software developer

Combining data from multiple sources with Spark and Zeppelin

I’ve been doing a lot with Spark lately, and I love how easy it is to pull in data from various locations, in various formats, and have be able to query/manipulate it with a unified interface.

Last week, I had a situation at work where I needed to get a list of users (from our mySQL database), filtered with ancillary data from JSON files in S3 and a CSV file that had been sent to me and was sitting on my local machine. Using Spark and Zeppelin, I was able to do this in just a few minutes – analyzing a few GBs of data from multiple sources in multiple formats from my local machine took only a few minutes to execute, too (this approach would work with much larger data also, you just would want to run it on a cluster..).

I wanted to share a simplified, adapted version of that with others to show the power of Spark and Zeppelin. This is a really trivial example with a tiny amount of data, but hopefully it gives an idea of what is possible – and while it may seem kind of gimmicky, I’ve already used something very similar once for an actual business use case, and I can see lots of other situations where it could come in handy as well.

The Task

Marketing has sent us a list of users – some of whom they sent an email, some they didn’t. They want to know how many pages the users who got sent our email viewed versus users who did not. Unfortunately, our analytics data only has the user id, and marketing only has their email address – we’ll have to use the data in our database to bridge them.

Requirements

Zeppelin
mySQL
AWS S3 (You can obviously change this to a local file, if you want)

Files

I’ve uploaded samples for all of the assets you need, as well as the Zeppelin notebook itself:

SQL file for creating the users table and the users in itJSON file with analytics data to be loaded from S3Email list with all of the users that marketing emailed in CSV formatZeppelin file that has all of the code

I’ve posted screenshots in this post, if you’d like the code, grab the Zeppelin file.

Step 1: Imports

We need to import a mySQL driver, the AWS SDK (so we can load data from S3), and the databricks library that can create RDDs from CSV files.

Step 2: Load users from mySQL

We want to create a DataFrame from our users table in our database. Note that it will parse the schema and turn it into a DataFrame with similar column names as are in the table. We now have all of our users and their ids/names/emails available to us.

Step 3: Load JSON file from S3

Spark is really awesome at loading JSON files and making them queryable. In this case, we’re doing a little extra work to load it from S3 – just give it your access key, secret key, and then point it at the right bucket and it will download it and turn it into a DataFrame based on the JSON structure.

Now we have a record of all events from all of our users.

Step 4: Load the email list CSV from our local filesystem

We have a simple CSV file in the format of <email address>,<promo email sent: true|false>. Let’s read it in now, too, so we know who got an email and who didn’t. In this case, our CSV has a header file that Spark will use to create the DataFrame columns, and it will attempt to infer the schema based on the data in the file – we can also pass in an explicit schema if we need to.

Step 5: Query

Now, let’s write a query that joins them all together and tells us who got an email and how many events they produced. As you can see, this looks like any other SQL query, only it is pulling data in from a mySQL database, a JSON file we pulled from S3, and a local CSV file.

Step 6: Graph

We can even use the %sql interpreter in Zeppelin to write a query without any Scala code and graph the results:

Conclusion

As you can see, the combination of Spark and Zeppelin is incredibly powerful. Data exploration and complex reporting – even with terabytes of data spread across multiple data sources – is fairly easy to do.