Spotify's Scio Library and Google Cloud DataFlow

I recently attended a conference called DataEngConf in New York City and one of my favorite talks was from Spotify. They spoke about their data engineering workflow using a library that they recently opensourced called Scio. Scio is basically a Scala wrapper around Google Cloud DataFlow and Apache Beam, which is a data pipeline tool that is part of the Google Cloud suite of cloud-based tools. Spotify made news in the last year as a big customer to defect from AWS to Google Cloud. I have had good luck with another Spotify open source library called Luigi, so I wanted to give Scio a whirl.

The timing was optimal for this as a friend had a Google Cloud 60 day free trial expiring in the last week so I wanted to take advantage of this opportunity. I have used certain products in AWS but never evaluated Google Cloud so let's kill two birds with one stone.

After cloning the Scio Github repo locally and getting my Google Cloud Developer account set up, I set up the credentials to access the project, called hip-apricot-143923. I logged into the Storage product and created a bucket and loaded a 355MB text file containing weather data.

Being a data engineer I had heard of BigQuery, which is a data warehouse, so I created a dataset called sitesweather, created a table called sites, connected it to the file in the bucket, and manually defined the schema to map the fields. One of the pieces of advice mentioned in the Scio Wiki is to leverage BigQuery whenever you can, due to the optimizations in the product. You can manually type in query to produce datasets in the BigQuery. An example of the BigQuery interface is shown below, with results of the Scio application run.

On my laptop I got the Scio jar file and started up the REPL using this command:

This uploads the Scio jars files and associated code to a bucket specified in the setup, spins up a compute engine instance, runs the specified query against the BigQuery table, and returns the resultset to the REPL to be modified (or in this case, printed out). In the screenshot of the BigQuery interface below you can see the results of the various Scio runs which create temporary tables as part of the query/compute process. The schema of the selected table is LocationID and MinTemp.

I wish I had more time to explore Scio but alas my friends' Google Cloud trial expired before I had a chance to get too in depth.