Writing Hadoop Programs That Work Across Releases

In a fast-moving project like Apache Hadoop, there are always exciting new features introduced in each release. While it is tempting to make the most of these new features by upgrading to the latest release, users are often concerned about their code continuing to run.

In this post, you’ll get an overview of the the Hadoop API annotations and compatibility policies. Hadoop annotates specific APIs to be safe for use by end-users. By using these APIs, users can ensure their code works across a set of releases and be aware of what releases it might not work against.

Hadoop Releases and Compatibility

Before discussing the API-specific aspects, let us quickly look at the types of releases and the kinds of changes allowed in them. Hadoop releases (CDH or stock Apache Hadoop) belong to one of three categories, and adopt an x.y.z form of release numbering. The changes in a release determine the release type. The releases can be roughly characterized as follows (see Hadoop Roadmap for details):

Major release (x): Major releases are vehicles to ship new features that might require API/wire incompatible change. These releases are infrequent and shipped as needed.

Minor release (y): Minor releases are more frequent (every few months) and ship features and bug fixes that are API/wire compatible to previous minor versions in the same major version.

Point release (z): Point releases are meant to fix critical bug fixes and should not introduce any incompatibilities.

To help maintain compatibility, Hadoop interfaces and classes are annotated to describe the intended audience and stability.

For example, post the initial CDH 4.0.0 release, the following CDH4.y.z releases introduced several features, all of which maintain API and wire compatibility with other CDH4.y.z releases. Similarly, Apache Hadoop 1.y.z releases are API-compatible releases within the Hadoop-1 major release line.

Recently, the Hadoop community has formulated a compatibility guide that describes the various types of compatibility, including API source and binary compatibility, between releases. Hadoop developers (committers) follow the guide to determine the changes that can be included in a release. For end-users, the guide outlines potential incompatibilities to expect when upgrading to a release.

Java API Annotations and Compatibility

Hadoop interfaces and classes are annotated to describe the intended audience and stability in order to maintain compatibility with previous releases. (See Hadoop Interface Classification for details.)

Unstable – interfaces that might change incompatibly across any release, including point releases

Deprecated- for interfaces that might be removed in a later release, typically with information on alternative APIs to be used

As you can infer from the above, end-users are expected to use Public-Stable APIs. According to the compatibility guide, the Public-Stable APIs need to be deprecated in a major release before being removed in a subsequent major release. For instance, an API that is annotated “Public-Stable” in CDH 4.x release needs to be deprecated in CDH 5.x before being removed in CDH 6.x release. Note that two major releases span years, and even these deprecations/removals are expected to be rare. Essentially, by using Public-Stable APIs, users can ensure their programs work for several years.

Here are some examples:

org.apache.hadoop.mapred.JobClient class, that is used to submit MapReduce jobs, is annotated Public-Stable. It is safe to use and has to be deprecated in one major release before being incompatibly changed in the next major release.

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler class (the scheduler implementation for fair sharing of resources) is annotated LimitedPrivate(“yarn”) and is not intended to be used outside of YARN.

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy class (the policy to use with a particular queue in the FairScheduler) is annotated Public-Evolving. Hence, it is available for Public use, but can change in the future in an incompatible way.

API Semantics

The Hadoop community strives to ensure that the behavior of APIs remains consistent over versions, although changes for correctness may result in changes in behavior. Tests and javadocs specify the API’s behavior. The API implementation may be changed to fix incorrect behavior; in such a case, the change should be accompanied by updating existing buggy tests or adding tests in cases there were none prior to the change.

REST APIs

Hadoop REST APIs are specifically meant for stable use by end-users across releases, even major releases. WebHDFS is a good example of such a REST API. The YARN-specific REST APIs (including MapReduce on YARN) are expected to be stable starting with the GA release of Hadoop 2 (and the correponding CDH release).

Apache Hadoop 2.2.0 GA Release

The recent Apache Hadoop 2.2.0 release is the result of significant amount of API stabilization work by the community. Starting with this release, the community aims to preserve API compatibility in the Hadoop 2 releases to follow. We encourage end-users and downstream projects to try out the latest APIs and report (through JIRAs or cdh-user/hadoop-user mailing lists) any issues encountered.

Karthik Kambatla is a Software Engineer at Cloudera in the scheduling and resource management team and works primarily on MapReduce and YARN.