One of the priorities for Snowplow in 2017 is extending our software to run on public clouds other than AWS, such as Google Cloud Platform and Microsoft Azure, and to run on-premise using Apache Kafka. Two major goals for this are:

Broadening Snowplow adoption - we want Snowplow to be available to companies on whichever cloud or on-premise architecture they have adopted

Offering “pipeline portability” - investing in Snowplow shouldn’t lock you into a specific cloud platform indefinitely. You should be able to bring the core capabilities - and data - of your Snowplow pipeline with you when you migrate from one cloud to another

When we are exploring a new technology or platform, we like to start by open-sourcing a simple example project (like our Spark Example Project or our AWS Lambda Node.js Example Project). Tasked with learning more about Google Cloud Platform, we set out to create an example project using some of GCP’s APIs and services that are most likely to be used by a Snowplow port. That way, both our team and our community get familiarized with aspects of that ecosystem, specifically:

Google Cloud Dataflow

Google Cloud Dataflow is both the name of a unified model for batch and streaming data processing and a GCP managed service for executing data processing jobs using that model.

This example project is a simple analytics-on-write job - the kind of event analytics that could occur in a real-life event data pipeline.
Our Dataflow job reads a Pub/Sub topic containing events in a JSON format:

Our job counts the events by type and aggregates these counts into 1 minute buckets. The job then takes these aggregates and stores them into a table in Google Cloud Bigtable.

There are two major steps in our job:

Downsampling: where we reduce the event’s ISO 8601 timestamp down to minute precision, so for instance “2015-06-05T12:54:43.064528” becomes “2015-06-05T12:54:00.000000”. This downsampling gives us a fast way of bucketing or aggregating events via this downsampled key

Bucketing: an aggregation technique that builds buckets, where each bucket is associated with a downsampled timestamp key and an event type criterion. By the end of the aggregation process, we’ll end up with a list of buckets - each one with a countable set of events that “belong” to it.

These will prompt links you’ll have to follow on your host OS, which will allow you to use your browser to authenticate.

Step 3: Create your Pub/Sub topic

We’re going to set up the Pub/Sub topic. Your first step is to create a topic and verify that it was successful. Use the following command to create a topic named “test-topic”:

$ inv create_pubsub_topic --topic-name=test-topic

Step 4: Create a Bigtable table for storing our aggregates

We’re using “test-table” as the table name, “test-instance” as the instance name and “cf1” as the column family. Bigtable is a NoSQL clustered database. It has the concept of column families: groups of columns that are related and as such, are likely to be accessed roughly at the same time. In the context of this project, all the columns will live in the same column family.

Step 5: Create a staging location for Dataflow

Dataflow needs a place to store the jarfile that it will use to run our job. This is called the staging location.

To create one for your project, go to https://console.cloud.google.com/storage/browser?project=YOUR-PROJECT-ID and click “Create Bucket”. Fill in the appropriate details. As this is an example project, we suggest you pick the cheapest option:

After creating your bucket, you might want to create a folder inside it, to serve as your staging location; you can also just use the bucket’s root. To create a folder in your bucket, after you created the bucket, select it in the list and then click “Create Folder”.

Your staging location will be:

gs://your-bucket-name/your-folder

Step 6: Submit your application to Dataflow

Before you can run your application, you’ll need to setup a proper config file. There’s an example in config/config.hocon.sample. If you’ve been using the same names as we did, you’ll only need to perform some minimal changes, specifically:

Updating the project ID

Updating the Pub/Sub topic’s full name

Updating the staging location in Cloud Storage to upload the fat jar to

Then, inside your Vagrant box run (assuming you’ve built the project, as instructed. Alternatively, you can download it from Bintray):

If you leave out the --nr-events parameter, the event generator will run on an infinte loop (to kill it, Ctrl+C)

Step 8: Monitor your job in the Dataflow UI

Now there are two places we want to check to see if everything is running smoothly.

The first place is Dataflow’s web interface. Go to https://console.cloud.google.com/dataflow?project=YOUR-PROJECT-ID and select the job you just submitted. You should then see a graph with the several transforms that make up our data pipeline:

You can click on the transforms to get specific info about each one, such as their throughput. If something is not working properly, you’ll get warnings under “Logs”. You can also check the central log in: https://console.cloud.google.com/logs?project=YOUR-PROJECT-ID Remember to change YOUR-PROJECT-ID to the appropriate value in these URLs.

Step 9: Review your data in Bigtable

The second place you’ll want to check is your Bigtable table. To do so, you’ll need to use HBase, and we suggest you to do it inside your Google Cloud Shell (reference.

To access your Google Cloud Shell, go to your Google Cloud Dashboard (https://console.cloud.google.com/home/dashboard?project=YOUR-PROJECT-ID) and click on the little terminal symbol on the top right corner. That will boot your shell. Wait for it to become available, then:

This will boot your HBase shell, which you’ll use to interact with your Bigtable table. At this point,
you can run (assuming your table is named “test-table”):

hbase-shell> scan "test-table"

This will print the rows in your table, where you can see the counts for each bucket and type:

Step 10: Shut everything down

Remember to shut off:

Python event generator

Bring down all the resources used on GCP. To do this you can either delete your project (recommended) or you can manually:

Halt your Dataflow job

Delete your Pub/Sub topic

Delete your Bigtable table

Delete your staging location and bucket

Exit your Vagrant guest

vagrant halt

vagrant destroy

And that’s it! You have successfully run your first streaming analytics job on Google Cloud Platform. Along the way you gained some familiarity with GCP in general, and the Cloud Dataflow, Cloud Pub/Sub and Cloud Bigtable services in particular.

Our experiments with GCP to-date have left us impressed by the platform and the various data services that it offers; we are very excited about the opportunities for running Snowplow on Google Cloud Platform making idiomatic use of these services.

Our next step is to put together a Request for Comments on our Discourse, to set out our proposal for porting Snowplow to GCP. Stay tuned! And if in the meantime you have any questions about this tutorial, do please let us know.

Share

Guilherme Grijó Pires

Gui was a data engineering intern at Snowplow, responsible for our initial work porting Snowplow to Google Cloud Platform. You can find him on GitHub.