New! Cloudera Developer Newsletter

The internals of Oozie’s ShareLib have changed recently (reflected in CDH 5.0.0). Here’s what you need to know.

In a previous blog post about one year ago, I explained how to use the Apache Oozie ShareLib in CDH 4. Since that time, things have changed about the ShareLib in CDH 5 (particularly directory structure), so some of the previous information is now obsolete. (These changes went upstream under OOZIE-1619.)

One of the best new Apache Oozie features in CDH 5, Cloudera’s software distribution, is the ability to use cron-like syntax for coordinator frequencies. Previously, the frequencies had to be at fixed intervals (every hour or every two days, for example) – making scheduling anything more complicated (such as every hour from 9am to 5pm on weekdays or the second-to-last day of every month) complex and difficult.

Oozie’s new HA qualities help cluster operators sleep well at night. Here’s how it works.

One of the big new features in CDH 5 for Apache Oozie is High Availability (HA). In designing this feature, the Oozie team at Cloudera had two main goals: 1) Don’t change the API or usage patterns, and 2) the user shouldn’t even have to know that HA is enabled. In other words, we wanted Oozie HA to be as easy and transparent as possible.

While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.

The Sub-Workflow Action

One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.

We’re always looking for new ways to improve the usability of Oozie and of the workflow format.

When building complex workflows in Apache Oozie, it is often useful to parameterize them so they can be reused or driven from a script, and more easily maintained. The most common method is via ${VAR} variables. For example, instead of specifying the same NameNode for all of your actions in a given workflow, you can specify something like ${myNameNode}, and then in your job.properties file, you would define it like myNameNode=hdfs://localhost:8020.

One of the advantages of that approach is that if you want to change the variable (the NameNode in this example), you only have to change it in one place and subsequently all the actions will use the new value. This can be particularly useful when testing in a dev or staging environment where you can simply change a few variables instead of editing the workflow itself.

Apache Oozie has a Java client and a Java API for submitting and monitoring jobs, but what if you want to use Oozie from another language or a non-Java system? Oozie provides a Web Services API, which is an HTTP REST API. That is, you can do anything with Oozie simply by making requests to the Oozie server over HTTP. In fact, this is how the Oozie client and Oozie Java API themselves talk to the Oozie server.

In this how-to, I’ll explain how the REST API works.

What is REST?

In this installment of “Meet the Project Founder”, meet Apache Oozie PMC member (and ASF member) Alejandro Abdelnur, the Cloudera software engineer who founded what eventually became the Apache Oozie project in 2011. Alejandro is also on the PMC of Apache Hadoop.

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Hue is an open-source web interface for Apache Hadoop packaged with CDH that focuses on improving the overall experience for the average user. The Apache Oozie application in Hue provides an easy-to-use interface to build workflows and coordinators. Basic management of workflows and coordinators is available through the dashboards with operations such as killing, suspending, or resuming a job.

Prior to Hue 2.2 (included in CDH 4.2), there was no way to manage workflows within Hue that were created outside of Hue. As of Hue 2.2, importing a pre-existing Oozie workflow by its XML definition is now possible.

How to import a workflow

Apache Oozie, the workflow coordinator for Apache Hadoop, has actions for running MapReduce, Apache Hive, Apache Pig, Apache Sqoop, and Distcp jobs; it also has a Shell action and a Java action. These last two actions allow us to execute any arbitrary shell command or Java code, respectively.

In this blog post, we’ll look at an example use case and see how to use both the Shell and Java actions in more detail. Please follow along below; you can get a copy of the full project at Cloudera’s GitHub as well. This how-to assumes some basic familiarity with Oozie.