Secrets of Cloudera Support: Using OpenStack to Shorten Time-to-Resolution

Automating the creation of short-lived clusters for testing purposes frees our support engineers to spend more time on customer issues.

The first step for any support engineer is often to replicate the customer’s environment in order to identify the problem or issue. Given the complexity of Cloudera customer environments, reproducing a specific issue is often quite difficult, as a customer’s problem might only surface in an environment with specific versions of Cloudera Enterprise (CDH + Cloudera Manager), configuration settings, certain number of nodes, or the structure of the dataset itself. Even with Cloudera Manager’s awesome setup wizards, setting up Apache Hadoop can be quite time consuming, as the software was never designed with ephemeral clusters in mind.

For these reasons, until recently, a significant amount of our support engineers’ time was spent creating virtual machines, installing specific Cloudera Manager and/or CDH versions, and setting up the services installed on the customer’s cluster. To make matters worse, engineers had to use their own laptops to run four-node Hadoop clusters, which was not only taxing on the machine’s resources but often forced an engineer to reproduce one issue at a time—which is incredibly inefficient as support engineers are very rarely working just a single ticket. This approach required a careful balance of time management and swift case resolutions to keep things moving smoothly.

In recognition of this problem, Cloudera Support determined that the first step toward streamlining how engineers reproduce issues—and thus reducing resolution time for our customers overall—was to give them a fast, scalable way to deploy ephemeral instances, so that they could install the necessary software on any number of nodes without having to run them locally on their machines.

In this remainder of this post, we’ll explain how OpenStack was the ideal choice for a self-service tool that meets these goals.

On OpenStack

Using OpenStack, a user clicks a button and seconds later gets a brand-new virtual machine on which to install software, run tests, and then tear down when finished. Countless companies use the OpenStack core to build internal cloud infrastructure, for various reasons:

Scale-out architecture—more nodes equals more resources

Incredibly active community—contributions from hundreds of companies

API centric—anything can be scripted

Lightning-fast instance spin-up—brand-new instances in under a minute

After considering all the major OpenStack distributions, we chose to use Red Hat’s RDO for the reasons listed below:

Simple yet powerful install procedure

Great documentation

100% open source core; no lock-in

Production-quality updates

Optional enterprise support

Very approachable community (for questions, comments, bugs)

Within two weeks of deploying what we now call Support Lab, we already had the majority of our support staff using it and providing great feedback and suggestions.

Adding Cloudera Manager

Very quickly we realized that we needed to ride this momentum, so we started the next phase of the project: to fully automate the deployment of instances as well as Cloudera Enterprise. The goals were simple:

Abstract the creation of instances in OpenStack

Fully bootstrap the installation of Cloudera Enterprise

Allow the user to mix-and-match CDH/Cloudera Manager versions

Only install services defined by the user (HDFS, Impala, HBase, and so on)

Make the cluster size configurable (1 node => n nodes)

By utilizing the Python APIs of both OpenStack and Cloudera Manager, we were able to create a simple web application to completely orchestrate all the steps an engineer would normally take to set up a cluster manually. Below is a screenshot of what this piece of automation looks like to the user.

After clicking the Deploy button, the user is brought to the following screen where they can monitor the status of the deployment. This particular deployment took only 16 minutes to stand up a fully functioning CDH 5.1 Apache HBase cluster, a process that would normally take hours of an engineer’s time. In addition to the monitoring page shown below, the user is also notified via HipChat so they can redirect their attention to something more important during the actual deployment:

And just like that, the engineer has a brand-new temporary cluster to use for testing.

Conclusion

Overall, OpenStack met all of our expectations for automating cluster deployments, and as our team grows, we’ll continue to invest in our automation infrastructure. Many thanks to Red Hat and the countless other companies that contribute to the project—we can say for sure that our support staff (and customers) appreciate it.

We needed a more customized approach — e.g., we have a roadmap of items that would only be useful to people in support, such as the ability to deploy pre-broken clusters or have customer configurations injected into the deployment.