Update on Apache Bigtop (incubating)

Introduction

Ever since Cloudera decided to contribute the code and resources for what would later become Apache Bigtop (incubating), we’ve been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care? The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that “Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem”. That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation’s (ASF) Hadoop ecosystem projects, yet it doesn’t really help you understand the aspirations of Bigtop.

The history

Cloudera was the first company to create an open source distribution that included Apache Hadoop, releasing the first version (CDH1) back in March, 2009. The initial goal of CDH was to make Apache Hadoop easier to adopt, providing packaging to enable users to install Hadoop on popular Linux operating systems and not have to compile from source.

In mid-2010 Cloudera announced a major change in CDH that eventually came to recast what defined an Apache Hadoop based distribution. We observed that users were typically running not just Apache Hadoop but also a collection of other open source systems and components that were quickly becoming essential to have a fully functioning data management system. But in order to run such a system, organizations needed to do a great deal of work: assembling and integrating sometimes as many as a dozen different components. Each open source component had its own release schedule, dependencies, interfaces and standards for quality.

CDH3 was the first time a great many of these components were provided together all as an integrated system. Since that time we’ve updated the distribution on a regularly quarterly schedule and recently released a new major version (CDH4).

That notion of a Hadoop distribution has become the industry’s prevailing definition:

An integrated set of open source components that make up a Apache Hadoop based data management system

Integrated & tested to work together

Tested & packaged to work on a standardized set of platforms

Today, all providers of Apache Hadoop distributions essentially follow this model and many in fact simply choose to redistribute CDH.

The motivation

Building and supporting CDH taught us a great deal about what was required to be able to repeatedly assemble a truly integrated, Apache Hadoop based data management system. The build, testing and packaging cost was considerable, and we regularly observed that different projects made different design choices that made ongoing integration difficult. We also realized that more and more mission critical workload was running on CDH and the customer demand for stability, predictability and compatibility was increasing.

Apache Bigtop was part of our answer to solve these two different problems. Initiate an Apache open source project that focused on creating the testing and integration infrastructure of an Apache-Hadoop based distribution. With it we hoped that:

We could better collaborate within the extended Apache community to contribute to resolving test, integration & compatibility issues across projects

We could create a kind of developer-focused distribution that would be able to release frequently, unencumbered by the enterprise expectations for long-term stability and compatibility.

This would enable us to make progress faster, iterating quickly with new releases of all the projects included in the distribution without worrying about a high rate of change or compatibility breaking that would be difficult to inject into our stable, supported enterprise distribution.

We could help create a community process around the development of a distribution itself. We imagined this would be beneficial both for Apache Bigtop and ultimately for CDH.

Progress so far

It’s been nearly 1 year since Apache Bigtop was proposed to the incubator and we’ve been thrilled with the progress. There have been 4 releases so far, keeping with a goal of delivering fixed-time, variable scope “train” releases. The project started with a diverse range of contributors and this diversity has broadened over time. We’ve seen new contributions from various corporate sponsors but more importantly from members of related communities. Apache Hama was added to Bigtop for example and a member of that community was added to the project in the process. There’s been a similar investment to add Apache Giraph (incubating) to the project.

The overall rate of activity within the Apache Bigtop is accelerating. More patches are contributed each month. More individuals are joining the user and developer lists. This project comes at an important time in the evolution of the Hadoop stack. There are more than half a dozen new projects that have recently spawned to extend the feature set of the Apache Hadoop stack and Bigtop represents an opportunity to integrate more of them more quickly into the context of a larger more strategic data management system.

What Apache Bigtop means for you

If you are:

A casual user (a big data hacker): Bigtop provides a fully integrated, packaged, and validated stack of big data management software based on the Apache Hadoop ecosystem specially tailored for your favorite version of Linux OS (and perhaps other OS’s in the future). The packaged artifacts and the deployment experience will be very similar to CDH, but with two key exceptions:

CDH leverages backporting of patches to provide long term support on a stable release while providing stability and compatibility, but a Bigtop distribution will be much more aggressive in tracking the very latest versions of Hadoop ecosystem components even if it injects instability or compatibility changes from release to release.

Apache Bigtop will include a wider range of systems and components, for many of which Cloudera may not provide production support (e.g. Apache Hama, Apache Giraphe).

For OS vendors: Bigtop provides a readily available source of packaging, validation, and deployment code that can be used as a basis for integration of Apache Hadoop into the OS bundles.

A company building its own distribution that includes Apache Hadoop: Bigtop could be a good point of departure and a treasure trove of wheels that don’t need to be reinvented.

Parting Thoughts

Apache Bigtop (incubating) is still a very young project. We have some ambitious goals in mind, but we can’t possibly achieve them without your help. We need your feedback and we need your involvement. As always, patches are welcome.