What’s New in CDH3b2: Oozie

Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.

Why create a new workflow system?

You might wonder why a new workflow system is necessary for Hadoop, given that there are quite a few existing commercial and open-source systems available. While it is possible to use existing general-purpose workflow systems with Hadoop, it is anything but simple. Intricacies such as monitoring long running jobs and interfacing with the distributed file system require extensive work to port general workflow systems to the Hadoop environment. Oozie, on the other hand, is designed specifically for the Hadoop platform and uses it as its execution environment. It has built-in support for Hadoop tasks and integrates with this environment cleanly. Oozie itself is fairly light-weight, requires minimal configuration, and scales linearly – thus offering a sustainable approach to building workflows in the Hadoop environment.

Still not convinced about Oozie? Consider these numbers for a moment: According to the Oozie presentation during Hadoop Summit in June – there are over 4800+ workflow applications deployed within Yahoo! at the moment, with largest workflow containing 2000 actions. There were roughly 55,000 workflow jobs that Yahoo! infrastructure team executed in the month of May 2010 alone, with workflows that could run up to many hours.

A Simple Use-Case

Consider the example of web log analysis. For a typical operation that deals with a few gigabytes of log data every day, the steps involved in analyzing it can be many. First, the files have to be moved into a certain location. Next, the files are used to create new tables or partitions in Hive which are then queried to see if certain criteria have been met. For instance, if the number of accesses for a particular resource exceeds a certain threshold, some notifications must be generated. Regardless of the outcome of this analysis, certain other queries need be run in order to populate other tables that record rolled-up information.

While these steps are not difficult to execute, they are repetitive and time consuming. Ideally, such steps should be automated in a manner that notifications are raised to operators when something interesting is discovered by the system or if there is a failure of some sort. That is exactly what Oozie does.

Using Oozie, all the steps outlined in this example can be modeled as a workflow which can be executed with a single command. Once the workflow takes off, you can sit back and relax while Oozie runs through each step of the flow.

Oozie Highlights

Oozie workflow bundles the workflow definition, any libraries necessary for the execution of workflow actions, and properties that are necessary to resolve parameterized values in the workflow. Together, this bundle is referred to as an Oozie application and informally – a workflow. These are deployed to the Oozie server using a command line utility. Once deployed, the workflows can be started and manipulated as necessary using the same utility. The web console for Oozie server can be used to monitor the progress of various workflow jobs being managed by the server.

Scalability

Oozie is a server-based web application that uses a transactional store to manage workflow metadata and execution states. It relies on HTTP-based notifications and polling mechanism to monitor the progress of workflows and to manage its runtime state. The Oozie server itself does not do any particular work other than this state management. All of the work is delegated to worker nodes within the cluster on which the workflow executes. This allows Oozie to scale horizontally by adding more Oozie servers pointing to the same workflow metadata store.

Resilience

When a workflow execution encounters a transient error condition, Oozie automatically attempts to execute the action again. In some situations, when the error requires user intervention, Oozie can suspend the workflow indefinitely allowing the administrator to step in, take corrective action, and resume the workflow. For long running workflows that fail, Oozie provides a mechanism by which the workflow can be restarted from the point of failure to avoid redoing the steps that may have already completed earlier.

Simple and Intuitive

Workflows in Oozie are expressed in a simple XML representation that is inspired by process definition language – JPDL. However, compared to the overal JPDL schema, the Oozie schema is extremely simplified and intuitive. The key concepts in a Oozie workflow is that of action and control flow nodes. Action nodes do the workflow tasks – such as moving files, running Map/Reduce jobs, running Hive Queries etc. Control-flow nodes govern the progress of the workflow from action to action, enabling things like error handling, conditional execution and branching logic.

Together, the action and control-flow nodes are arranged in a directed acyclic graph (DAG), which represents the overall workflow. This DAG is executed by the Oozie server in a controlled-dependency manner – implying that a node shall be executed if and only if all of the nodes that it depends upon have been executed successfully. This is very similar to how one would manually implement the workflow – by initiating actions and starting follow-up actions when the previous ones are complete with expected outcome.

The bottom line is that if you know the steps you need to take for managing data in your Hadoop environment, you can easily express them as a workflow and hand it off to Oozie for execution.

Rich Set of Features

The core feature set of Oozie is designed to take care of the most commonly-exercised functionality for the Hadoop platform. The key objective behind these features is to ensure that anything done manually can be implemented as a workflow task to the last detail. The following list, while far from being exhaustive, lists out some of the many features that Oozie has.

1. Parameterization: With parameterization support, workflows can be written once and executed many times with different parameter bindings. This allows reuse of workflows in a manner that promotes ease of maintenance and management.

2. Fine-Grain and Coarse-Grain Notification Support: Workflows in Oozie can be configured to notify external systems at varying degree of granularity. Notifications can be raised when a workflow changes its overall state, or when an individual action within it changes state. These notifications are implemented as HTTP GET requests which can pass extra information to the receiver such as job identifier. Using this mechanism, external systems can be integrated at various stages of the workflow as necessary.

3. User Propagation: The user and group information associated with the workflow job is propagated by Oozie to the underlying action execution and cannot be overwritten. This allows Oozie to work together with Hadoop security to ensure that actions are authorized to access and manipulate data where applicable.

4. Java Client API: A programmatic client API is provided by Oozie that can be called from external systems to better integrate with the Oozie system. This API provides equivalent functionality as provided by the command line client utility.

5. Web Services API: Oozie also provides a rich REST/JSON API for web-services integration. Clients that prefer to access Oozie via this API can directly access and manipulate workflows running on Oozie server using this interface.

6. Built in actions: The default actions provided by Oozie cover a vast majority of the use-cases for workflows including:

Map/Reduce action: Allows you to model Map/Reduce jobs.

Streaming Map/Reduce action: Allows you to specify executable mapper and reducers that can be plugged in using the streaming support.

Pig action: Allows you to run custom Pig scripts and tasks.

FS action: Allows you to manipulate the Hadoop file system as necessary.

Sub-Workflow action: Allows a workflow to be executed within another workflow.

Java action: Allows you to plug in any Java program that has a main method for direct execution.

Hive action: Allows you to run Hive commands from within the workflow. This action is contributed by Cloudera.

Sqoop action: Allows you to run Sqoop commands from within the workflow. This action is contributed by Cloudera.

7. Built in control-flow nodes: The control-flow nodes provided by Oozie allow you to create sophisticated workflow graphs. Together with built-in support for JSP Expression Language functions, the control-flow nodes can be used for creating conditional executions where necessary. These include the following:

Fork and Join node: Fork node allows the splitting of a workflow execution path into multiple concurrent paths of execution. Join nodes allow the branches created by fork node to merge back into a single execution path. Fork and join nodes must be used in pairs and allow efficient parallel execution of tasks that do not have a direct-controlled dependency order.

8. Custom Extensions: Oozie provides an extension mechanism that allows the implementation of custom actions. This mechanism should be used when the extension cannot be modeled as a regular Java action.

Get up and running!

Now that you have a good feel for what Oozie is and how simple it is, it is time for you to get your instance of Oozie installed and configured. Follow our Quick Start Guide to get up and running with Oozie in a matter of minutes. You can reach out to the Oozie user group by sending a mail to oozie-users@yahoogroups.com.