Proc Out: A Guide on Utilizing Talend with Google Cloud Dataproc

Proc Out: A Guide on Utilizing Talend with Google Cloud Dataproc

Mark Balkenende is a Sales Solution Architects Manager at Talend. Prior to joining Talend, Mark has had a long career of mastering and integrating data at a number of companies, including Motorola, Abbott Labs and Walgreens. Mark holds an Information Systems Management degree and is also an extreme cycling enthusiast.

Recently we updated our connectivity to all our Google components as well as introduced some new functionality for Google's Cloud Platform. Just as a starting list, we now support Google Dataproc and Pub/Sub in addition to all of their storage options.

In this blog, I want to show how impressive Google's Big Data Processing Platform called Dataproc is and how you can use Talend to quickly build and schedule process pipelines in Google Cloud Platform with Talend's Big Data Platform.

I recently built a process to query live Twitter data into Google Storage and then push that data to the Dataproc to process and pull keywords and get the key insights into BigQuery. With Talend, I can also build the pipeline to dynamically start and stop the Dataproc clusters. I no longer need a long-running cluster, I can now provision a full Apache Spark cluster (plus other Apache services) in a matter 60-90 seconds, then do the needed processing and stop the cluster once completed. YES, you read that right I can start a fully functioning Spark cluster in Google Dataproc in seconds (and, no Google is not paying me to say nice things.). I have used many other services to launch Hadoop based clusters and fully functional Spark clusters and never have I see one go so FAST!

Let’s take a look at my process in a little more detail. Below is the high-level Talend job that will do all the sub-steps or what we sometimes refer to as the Master Job or the Processing Pipeline.

If you are thinking, “That looks too simple what can that be doing?” Well, as I said earlier the first component queries Twitter for a keywords or hashtags that I’m interested in analyzing. The full text of the tweets goes into a Google Storage bucket. Then the Dataproc cluster is started. Talend provides out of the box components to start and provision Dataproc Clusters and it is quite simple to use. As you can see in the image below, you just need to input the basic info and provide the correct security credentials to Google and it is up and running.

Machine Learning to the Rescue

Once the Cluster starts we use a Spark Machine Learning process to parse the tweets. Once that is parsed into JSON a second Spark process to read the JSON output and load the keywords into a Big Query table that can be easily ingested into a tool such as Tableau or Qlik.

At the time of writing Talend supports these types of Machine Learning and Processing on the latest Google Dataproc version which supports Apache Spark 2.1 for any Spark Processing supported on Talend Real-time Big Data Platform. This enables you to handle streaming use cases on the Google Pub/Sub architecture as well as all of Google's Big Data Storage types such as Google Storage and BigQuery. I hope this will inspire you to take a closer look at how can I quickly build Data Processing pipelines with Google Cloud Platform and Talend today!