Apache Beam in Action: Same Code, Several Execution Engines

Apache Beam in Action: Same Code, Several Execution Engines

In my previous article, I provided a quick introduction to Apache Beam, a new distributed processing tool that's currently being incubated at the ASF. Apache Beam provides an abstraction layer allowing developers to focus on Beam code, using the Beam programming model. Thanks to Apache Beam, an implementation is agnostic to the runtime technologies being used, meaning you can switch to technologies quickly and easily.

Now that Apache Beam 0.2.0-incubating has just been released, it’s a perfect time to jump into the key features that the technology provides. In this article I’ll show a first pipeline use case, and then will show you how to execute the same pipeline code on different execution engines.

Context: GDELT analyses

We are going to create a pipeline to analyze GDELT project data which monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world. It creates daily CSV files, containing one line per event.

For example, an event from the CSV file looks like:

Our purpose in using Apache Beam is to extract each event (each line), extract the location code (JPN), and group the events per location. Then, we will be able to simply count the number of events per location. The location of the CSV file should be an option of the pipeline.

To do this, we will need to create a “wrapper” class. It will contain the pipeline options definition, and a main method running the pipeline:

Here, we want to show one Beam feature: the same code running on different execution runtimes. In order to do that, we will use Maven profiles. Each profile will define a specific runner dependencies set. Then, we will be able to execute our pipeline (exactly the same code) on a target runner, just using specifying a JVM argument to identify the runner.

Direct runner

Let’s start with the Direct runner. This is the preferred runner to use for test: it uses several threads in the JVM.

It’s pretty easy to use as it just requires a dependency. So, we create a Maven profile with the Direct runner dependency:

1

2

3

4

5

6

7

8

9

10

11

12

13

<profile>

<id>direct-runner</id>

<activation>

<activeByDefault>true</activeByDefault>

</activation>

<dependencies>

<dependency>

<groupId>org.apache.beam</groupId>

<artifactId>beam-runners-direct-java</artifactId>

<version>0.2.0-incubating</version>

</dependency>

</dependencies>

</profile>

We can now run our pipeline on this runner. For that, we use our direct-runner profile and use --runner=DirectRunnerJVM argument:

Apache Spark runner

The Spark runner requires more dependencies (due to Apache Spark runtime). So, again, we create a Maven profile to easily define the dependencies. The Apache Spark engine itself as well as the required Spark dependencies.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

<profile>

<id>spark-runner</id>

<dependencies>

<dependency>

<groupId>org.apache.beam</groupId>

<artifactId>beam-runners-spark</artifactId>

<version>0.2.0-incubating</version>

</dependency>

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>1.6.2</version>

</dependency>

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming_2.10</artifactId>

<version>1.6.2</version>

</dependency>

<dependency>

<groupId>com.fasterxml.jackson.core</groupId>

<artifactId>jackson-core</artifactId>

<version>2.7.2</version>

</dependency>

<dependency>

<groupId>com.fasterxml.jackson.core</groupId>

<artifactId>jackson-annotations</artifactId>

<version>2.7.2</version>

</dependency>

<dependency>

<groupId>com.fasterxml.jackson.core</groupId>

<artifactId>jackson-databind</artifactId>

<version>2.7.2</version>

</dependency>

<dependency>

<groupId>com.fasterxml.jackson.module</groupId>

<artifactId>jackson-module-scala_2.10</artifactId>

<version>2.7.2</version>

</dependency>

</dependencies>

</profile>

Similar to the Direct runner, we can directly use the spark-runner profile and the --runner=SparkRunner JVM argument to execute our pipeline on Apache Spark. Basically, it performs the equivalent of a spark-submit:

Conclusion

In this article, we saw how to execute exactly the same code (no change at all in the pipeline definition) on different execution engines like Apache Spark, Google Dataflow and Apache Flink. Utilizing Apache Beam, we can easily switch from one engine to another simply by changing the profile and runner. In a next article, we will take a deeper look on the Beam IOs: the concepts (sources, sinks, watermark, split, …) and how to use and write a custom IO.