Oozie by Example

In our previous article [Introduction to Oozie] we described Oozie workflow server and presented an example of a very simple workflow. We also described deployment and configuration of workflow for Oozie and tools for starting, stoping and monitoring Oozie workflows.

In this article we will describe a more complex Oozie example, which will allow us to discuss more Oozie features and demonstrate how to use them.

Defining process

The workflow which we are describing here implements vehicle GPS probe data ingestion. Probes data is delivered to a specific HDFS directory[1] hourly in a form of file, containing all probes for this hour. Probes ingestion is done daily for all 24 files for this day. If the amount of files is 24, an ingestion process should start. Otherwise:

For the current day do nothing

For the previous days – up to 7, send the reminder to the probes provider

If the age of the directory is 7 days, ingest all available probes files.

Fastly’s edge cloud platform powers secure, fast and reliable online experiences for the world’s most popular digital businesses. See for yourself.

The overall implementation of the process is presented at Figure 1

(Click on the image to enlarge it.)

Figure 1: Process diagram

Here the main process (ingestion process) first calculates directories names for current and 6 previous days and the starts (forks) 7 directory subprocesses (subflows). Once all subprocesses reach the end state, the join step will transfer control to end state.

The subprocess starts by getting information about the directory – its age and amount of files. Based on this information, it makes a decision whether to ingest and archive data, send reminder email or do nothing.

Directory subprocess implementation

The workhorse of our implementation is directory subprocess (Listing 1)

This class gets directory name as an input parameter and first checks whether directory exists. If directory does not it will return -1 for both age and number of files, else, both age and the number of files will be returned to the subprocess.

The next step in the subprocess is a switch (decision) statement, which decides how to process the directory. If directory does not exist (number of files < 0), or it is current (directory age < 1) and number of files is less than 24 (number of files < 24) subprocess transitions directly to the end. Iif all the files are in the subdirectory (number of files > 23) or directory is at least 7 days old (directory age > 6), the following will occur:

Additional configuration on action nodes

Prepare - The prepare element, if present, indicates a list of path to delete before starting the job. This should be used exclusively for directory cleanup. The delete operation will be performed in the fs.default.name filesystem.

Configuration - The configuration element, if present, contains JobConf properties for the Map/Reduce job. It can be used not only for map/reduce action. But also in java action that starts map/reduce job

If neither of the above cases is true then a subprocess sends remainder email and exits. An email is implemented as another java main class (Listing 3)

Conclusion

In this article we have shown a more complex end-to-end workflow example, which allowed us to demonstrate additional Oozie features and their usage. In the next article we will discuss building a library of reusable Oozie components and extending Oozie with custom nodes.

Acknowledgements

Authors are thankful to our Navteq colleague Gregory Titievsky for implementing the majority of the code.

About the Authors

Boris Lublinsky is principal architect at NAVTEQ, where he is working on defining architecture vision for large data management and processing and SOA and implementing various NAVTEQ projects. He is also an SOA editor for InfoQ and a participant of SOA RA working group in OASIS. Boris is an author and frequent speaker, his most recent book "Applied SOA".

Michael Segel has spent the past 20+ years working with customers identifying and solving their business problems. Michael has worked in multiple roles, in multiple industries. He is an independent consultant who is always looking to solve any challenging problems. Michael has a Software Engineering degree from the Ohio State University.

What are different ways to set OOZIE_ACTION_OUTPUT_PROPERTIES property?

Specifically, when I am running Oozie as LocalOozie in my testing environment, different actions in my workflow are setting different values to OOZIE_ACTION_OUTPUT_PROPERTIES property. Looking at the path generated, I suspect it is due to different mappers get instantiated for different actions.

How to make all actions in my workflow using the same value for OOZIE_ACTION_OUTPUT_PROPERTIES property?

Is your profile up-to-date? Please take a moment to review and update.

Email Address

Note: If updating/changing your email, a validation request will be sent

Company name:

Keep current company name

Update Company name to:

Company role:

Keep current company role

Update company role to:

Company size:

Keep current company Size

Update company size to:

Country/Zone:

Keep current country/zone

Update country/zone to:

State/Province/Region:

Keep current state/province/region

Update state/province/region to:

Subscribe to our newsletter?

Subscribe to our architect newsletter?

Subscribe to our industry email notices?

You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.

We notice you're using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.