CDAP Workflows: In Comparison with Apache Oozie

August 12, 2015

Bhooshan Mogal is a Software Engineer at Cask, where he is working on making data application development fun and simple. Before Cask, he worked on a unified storage abstraction for Hadoop at Pivotal and personalization systems at Yahoo.

Apache Oozie is a workflow scheduler system to manage Apache Hadoop™ jobs. It is one of the most popular open-source workflow scheduler systems for Hadoop. Cask Data Application Platform (CDAP) is an open-source platform to build and deploy data applications on Hadoop. CDAP provides abstractions on top of Hadoop that enable developers to rapidly build, deploy, and manage real-time as well as batch data applications. CDAP includes a built-in workflow and scheduler system. This blogpost aims to introduce CDAP workflows by way of a functional comparison with Oozie.

Architecture and Design

The Oozie server is a Java web-application that runs in a Java servlet container such as Apache Tomcat, running on a client/gateway node for a Hadoop cluster.

The CDAP workflow engine, on the other hand, runs as part of the CDAP server (the AppFabric Service that runs inside the CDAP Master). CDAP workflows run as YARN applications (submitted/managed using Apache Twill).

Packaging, Distribution, and Installation

Since Oozie is a distinct Java web-application, it needs to be installed separately from other components in a Hadoop cluster, following a detailed, often cumbersome installation procedure. The CDAP workflow system, however, is simply a sub-component of CDAP; if you have installed CDAP on your cluster, you do not need to install another component for running workflows. Similarly, setting up Oozie in a secure environment requires you to secure an extra service (the Oozie server), whereas in CDAP, you do not need to secure an extra service just for running workflows.

Backing Metadata Store

Oozie uses a relational database as its backing metadata store. It currently supports HSQLDB, Apache Derby, MySQL, Oracle, and PostgreSQL for this purpose. CDAP workflows, on the other hand, use the CDAP Metastore as their backing metadata store. The CDAP Metastore uses Apache HBase in distributed mode and LevelDB in standalone mode. The program archives for CDAP workflows are stored on the file system, for which CDAP uses HDFS (or a similar distributed file system like MapRFS) in distributed mode or the local file system in standalone mode.

Driver Program

Oozie uses a map-only MapReduce job with a single mapper as a launcher job that is spawned for each action node. CDAP workflows use a YARN application developed using Apache Twill as its driver program.

API

Oozie workflow and coordinator definitions are written in hPDL (an XML Process Definition Language similar to the JBoss jBPM jPDL). Oozie also allows users familiar with Java to configure actions using Java code by implementing the OozieActionConfigurator interface. However, the Java code still must be referenced from within the workflow XML definition. Like all other CDAP abstractions, CDAP provides a Java programming API for defining the workflows.

Application Deployment

To deploy Oozie applications, users first have to define their workflow and coordinator XML files along with a job.properties file (containing parameters, location of workflow/coordinator XMLs, etc.) This has to be packaged with the libraries (.jar files) that the application requires. Oozie provides a Java Client API and a Command Line Interface (the Oozie CLI) to submit these applications to the Oozie server.

CDAP workflows, on the other hand, are Java classes that are built into a CDAP application (also Java code). The CDAP application is then submitted to the CDAP service as a single jar file. CDAP provides REST APIs, Java Clients, the CDAP CLI, or the CDAP UI to deploy applications. Once an application is deployed, users can start, stop, and manage workflows in the application like any other program in CDAP, using any of the interfaces listed previously.

Workflow Functional Specification

Action Nodes

Action nodes are nodes in a workflow that run computation or processing tasks.

Control Nodes

Control nodes define the beginning and end of a workflow. They also provide a mechanism to control the workflow execution path.

Oozie supports these control nodes:

Start control node: Defines the beginning of the workflow.

End control node: Defines the end of a workflow, indicating that the workflow has completed successfully.

Kill control node: Defines the end of a workflow, indicating that the workflow failed.

Decision control node: Enables the workflow to make a selection on which path to follow. It can be seen as a “switch-case” statement. Predicates are defined in the JSP Expression Language (EL).

Fork and Join control nodes: A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it.

CDAP workflows support these control nodes:

Start node: Signals the start of a workflow.

End node: Signals the end of a workflow. CDAP workflows do not distinguish between failure nodes and success nodes. The state of the workflow after it has completed execution indicates whether the workflow succeeded or failed.

Condition node: Allows conditional execution of nodes in a workflow. A condition is defined as a named Java class that implements Predicate, whose success or failure determines the path to follow. As opposed to the “switch-case” model of condition nodes in Oozie, CDAP workflows follow an “if-else” model.

Fork and Join nodes: Similar to Oozie, a fork node splits one path of execution into multiple concurrent paths of execution, while a join node waits until all concurrent paths from the corresponding fork node have completed execution.

Workflow Parameterization

Both Oozie and CDAP workflows can be parameterized. Oozie allows this using the JSP Expression Language (EL) with some built-in constants and functions, while CDAP workflows allow parameterization using Preferences and Runtime Arguments.

Workflow Jobs Recovery (or Re-Run)

Oozie provides a mechanism to restart a workflow from the node which failed in a previous run. This is especially useful when a workflow has nodes that were successful in a previous run but are too expensive to be re-run. CDAP workflows currently do not provide this feature, but it is a critical feature on the CDAP roadmap.

Passing Data between Nodes

Oozie nodes have a <capture-output> element which can capture the output of a command. You can then set the captured output as an argument of a subsequent node in your workflow XML. As of version 3.1.0, CDAP Workflows include a comprehensive Workflow Token, available to nodes in the workflow, and to which they can read and write to as they please. Users can:

pass custom data between nodes;

use the workflow token in Condition Predicates to determine the path of execution in a workflow;

use the workflow token in Action nodes to determine the configuration of an Action;

query the workflow token for a given scope, node, or for a specified key.

Scheduler Functional Specification

Both Oozie and CDAP have a built-in scheduler system to allow scheduling of workflows based on both time-intervals and data availability. Oozie provides a Coordinator for scheduling purposes. The time-based scheduler in the coordinator uses JSP Expression Language (EL) syntax. This can also be used to parameterize scheduler expressions using built-in constants and functions. The time scheduler in CDAP is a cron-expression based scheduler. Users specify a cron expression as part of the schedule definition.

For data-availability based schedules, the Oozie coordinator allows users to define datasets, and then schedule workflows based on their availability. CDAP provides a Stream Size Scheduler that can trigger the execution of a workflow based on availability of data in a Stream.

Concurrent Runs of a Workflow

Both the Oozie coordinator and CDAP allow multiple concurrent runs of a workflow. The coordinator allows applications to specify the concurrency level in the <concurrency> tag of the coordinator XML. CDAP also allows concurrent runs of a workflow, with the maximum number of concurrent runs governed by the scheduler.max.thread.pool.size parameter in cdap-site.xml.

UI

Oozie comes with a read-only UI by default. It also requires a non-Apache licensed library, which makes it difficult to install and use. An alternate UI, Hue is catching up with more advanced features for managing workflows. However, it is a separate project not included with Oozie.

CDAP workflows are integrated into the CDAP UI, which exposes capabilities to deploy applications containing workflows; set preferences and runtime arguments for workflows; start, stop, suspend, resume, and view workflow runs; as well as view workflow metrics and logs. The screenshot below shows the workflow page for a run of the WikipediaPipeline example application that is bundled with CDAP:

Monitoring

The Oozie server is instrumented and can publish notifications to JMS Providers. Notifications can then be consumed by any JMS consumer. CDAP workflows use the CDAP monitoring system ‘Watchdog’ that publishes metrics and logs (for both applications as well as the CDAP system) via Apache Kafka. The CDAP monitoring system is well-integrated with the CDAP UI.

Licensing/Governance

Both Oozie and CDAP are open-source Apache (V2) licensed software. Oozie is governed by the Apache Software Foundation, while CDAP is governed by the CDAP Community.

Summary

The following table summarizes the functional capabilities of Oozie and CDAP Workflows.

In summary, this blog introduces CDAP workflows and compares them functionally with Oozie. Stay tuned for future blogs that will discuss some of the capabilities of CDAP workflows in greater detail. In the meantime, feel free to try out our examples, reach out to us for any questions and consider helping us to develop the platform.