A First for Apache Beam

A First for Apache Beam

When he joined Talend, Laurent brought 17 years of software experience, including management and executive roles in customer support and product development. Most recently, Laurent was CTO at Axway where he was responsible for R&D, Innovation and Product Management. He has also spent more than nine years in the Silicon Valley, working for Business Objects and then SAP. Laurent holds an engineering degree in mathematics and computer science from EISTI.

At Talend, we like to be first. Back in 2014, we made a bet on Apache Spark for our Talend Data Fabric platform which paid off beyond our expectations. Since then, most of our competitors tried to catch-up…

Last year we announced that we were joining efforts with Google, Paypal, DataTorrent, dataArtisans and Cloudera to work on Apache Beam which since has become an Apache Top Level Project.

On January 23, 2017, we released Winter 17’, our latest Integration Platform, which included Talend Data Preparation on Big Data. In this blog, I’d like to drill a little bit more into the technology and architecture behind it, as well as how we are leveraging Apache Beam for scale and runtime agility.

1) Architecture

a) Overview

Figure1 below represents a high-level architecture of Talend Data Preparation Big Data with both the application layer and the backend server side.

You’ll notice the Beam JobServer part and more specifically the Beam Compiler (that allows the generation of an Apache Beam pipeline out of the JSON document) as well as the Beam runners where we specify the set of properties for Apache Beam runner target (Spark, Flink, Apex or Google DataFlow).

Note that in our Winter 17’ version, the only Apache Beam runner we support for the full run is Spark.

Figure 1. Talend Data Preparation with Apache Beam runtime

b) Workflow

Figure 2. From preparation DSL to Apache Beam pipeline

The Beam Compiler is invoked to transform the DSL into an optimized Beam Pipeline where the source, sink, and various actions are defined.

2) Details: What Gets Generated

a) Preparation DSL: JSON Document

Figure 3. Build your Data Preparation Recipe

As you apply your cleaning and enrichment steps, Talend Data Preparation generates a recipe which then gets transformed into a JSON document.

In the JSON example below, the input is a .csv file stored into HDFS. The file contains only two string columns

Talend Data Preparation is the first Talend Big Data application that leverages the portability and richness of Apache Beam. As we move forward, Apache Beam footprint will continue to be a growing part of Talend’s technology strategy and the backend part (as presented in this blog) will be reused by other applications in both Batch and Streaming contexts where the essence of Apache Beam and its runners will be used to their full extend. Stay tuned for more information!