Delivering eBay's CI Solution with Apache Mesos - Part I

Problem statement

In eBay’s existing CI model, each developer gets a personal CI/Jenkins Master instance. This Jenkins instance runs within a dedicated VM, and over time the result has been VM sprawl and poor resource utilization. We started looking at solutions to maximize our resource utilization and reduce the VM footprint while still preserving the individual CI instance model. After much deliberation, we chose Apache Mesos for a POC. This post shares the journey of how we approached this challenge and accomplished our goal.

Jenkins framework’s Mesos plugin

The Mesos plugin is Jenkins’ gateway into the world of Mesos, so it made perfect sense to bring the plugin code in sync with our requirements. This video explains the plugin. The eBay PaaS team made several pull requests to the Mesos code, adding both new features and bug fixes. We are grateful to the Twitter engineering team (especially Vinod) for their input and cooperation in quickly getting these features validated and rolled out. Here are all the contributions that we made to the recently released 0.2 Jenkins Mesos plugin version. We are adding more features as we proceed.

Mesos cluster setup

Our new Mesos cluster is set up on top of our existing OpenStack deployment. In the model we are pursuing, we would necessarily have lots of Jenkins Mesos frameworks (each Jenkins Master is essentially a Mesos framework), and we did not want to run those outside of the Mesos cluster so that we would not have to separately provision and manage them. We therefore decided to use the Marathon framework as the Mesos meta framework; we launched the Jenkins master (and the Mesos framework) in Mesos itself. We additionally wanted to collocate the Jenkins masters in a special set of VMs in the cluster, using the placement constraint feature of Marathon that leverages slave attributes. Thus we separated Mesos slave nodes into a group of Jenkins masters and another group of Jenkins slave nodes. For backup purposes, we associated special block storage with the VMs running the CI master. Special thanks to the Mesosphere.io team for quickly resolving all of our queries related to Marathon and Mesos in general.

Basic testing succeeded as expected. The Jenkins master would launch through Marathon with a preconfigured Jenkins config.xml, and it would automatically register as a Mesos framework without needing any manual configuration. Builds were correctly launched in a Jenkins slave within one of the distributed Mesos slave nodes. Configuring slave attributes allowed the Mesos plugin to pick the nodes on which slave jobs could be scheduled. Checkpointing was enabled so that build jobs were not lost if the slave process temporarily disconnected and reconnected back to the Mesos master (the slave recovery feature). In case there was a new Mesos master leader elected, the plugin used Zookeeper endpoints to locate the new master (more on this a little later).

We decided to simulate a large deployment and wrote a Jenkins load test driver (mammoth). As time progressed we started uncovering use cases that were unsuccessful. Here is a discussion of each problem and how we addressed it.

Frameworks stopped receiving offers after a while

One of the first things we noticed occurred after we used Marathon to create the initial set of CI masters. As those CI masters started registering themselves as frameworks, Marathon stopped receiving any offers from Mesos; essentially, no new CI masters could be launched. The other thing we noticed was that, of the Jenkins Frameworks that were registered, only a few would receive offers. At that point, it was evident that we needed a very thorough understanding of the resource allocation algorithm of Mesos – we had to read the code. Here is an overview on the code’s setup and the dominant resource fairness algorithm.

Let’s start with Marathon. In the DRF model, it was unfair to treat Marathon in the same bucket/role alongside hundreds of connected Jenkins frameworks. After launching all these Jenkins frameworks, Marathon had a large resource share and Mesos would aggressively offer resources to frameworks that were using little or no resources. Marathon was placed last in priority and got starved out.

We decided to define a dedicated Mesos role for Marathon and to have all of the Mesos slaves that were reserved for Jenkins master instances support that Mesos role. Jenkins frameworks were left with the default role “*”. This solved the problem – Mesos offered resources per role and hence Marathon never got starved out. A framework with a special role will get resource offers from both slaves supporting that special role and also from the default role “*”. However, since we were using placement constraints, Marathon accepted resource offers only from slaves that supported both the role and the placement constraints.

Certain Jenkins frameworks were still getting starved

Our next task was to find out why certain Jenkins frameworks were getting zero offers even when they were in the same role and not running any jobs. Also, certain Jenkins frameworks always received offers. Mesos makes offers to frameworks and frameworks have to decline them if they don’t use them.

The important point was to remember that the offer was made to the framework. Frameworks that did not receive offers, but that had equal resource share to frameworks that received and declined offers, should receive offers from Mesos. Basically, past history has to be accounted for. This situation can also arise where there are fewer resources available in the cluster. Ben Hindman quickly proposed and implemented the fix for this issue, so that fair sharing happens among all of the Jenkins frameworks.

Mesos delayed offers to frameworks

We uncovered two more situations where frameworks would get starved out for some time but not indefinitely. No developer wants to wait that long for a build to get scheduled. In the first situation, the allocation counter to remember past resource offers (refer to the fix in the previous paragraph) for older frameworks (frameworks that joined earlier) would be much greater than for the new frameworks that just joined. New frameworks would continue to receive more offers, even if they were not running jobs; when their allocation counter reached the level of older frameworks, they would be treated equally. We addressed this issue by specifying that whenever a new framework joins, the allocation counter is reset for all frameworks, thereby bringing them to a level playing field to compete for resources. We are exploring an alternative that would normalize the counter values instead of setting the counter to zero. (See this commit.)

Secondly, we found that once a framework finished running jobs/Mesos tasks, the user share – representing the current active resources used – never came down to zero. Double arithmetic led to a ridiculously small value (e.g., 4.44089e-16), which unfairly put frameworks that had just finished builds behind frameworks that had their user share at 0. As a quick fix, we used precision 0.000001 to treat those small values as 0 in the comparator. Ben Hindman suggested an alternative: once a framework has no tasks and executors, it’s safe to set the resource share explicitly to zero. We are exploring that alternative as well in the Mesos bug fix review process.

Final hurdle

After making all of the above changes, we were in reasonably good shape. However, we discussed scenarios where certain frameworks running active jobs and finishing them always get ahead of inactive frameworks (due to the allocation counter); a build started in one of those inactive frameworks would waste some time in scheduling. It didn’t make sense for a bunch of connected frameworks to be sitting idle and competing for resource offers when they had nothing to build. So we came up with a new enhancement to the Jenkins Mesos plugin: to register as a Mesos framework only when there was something in the build queue. The Jenkins framework would unregister as a Mesos framework as soon as there were no active or pending builds (see this pull request). This is an optional feature, not required in a shared CI master that’s running jobs for all developers. We also didn’t need to use the slave attribute feature any more in the plugin, as it was getting resource offers from slaves with the default role.

Our load tests were finally running with predictable results! No starvation, and quick launch of builds.

Cluster management

When we say “cluster” we mean a group of servers running the PaaS core framework services. Our cluster is built on virtual servers in our OpenStack environment. Specifically, the cluster consists of virtual servers running Apache Zookeeper, Apache Mesos (masters and slaves), and Marathon services. This combination of services was chosen to provide a fault-tolerant and high-availability (HA) redundant solution. Our cluster consists of at least three servers running each of the above services.

By design, the Mesos and Marathon services do not operate in the traditional redundant HA mode with active-active or active-passive servers. Instead, they are separate, independent servers that are all active, although for HA implementation they use the Zookeeper service as a communication bus among the servers. This bus provides a leader election mechanism to determine a single leader.

Zookeeper performs this function by keeping track of the leader within each group of servers. For example, we currently provision three Mesos masters running independently. On startup, each master connects to the Zookeeper service to first register itself and then to elect a leader among themselves. Once a leader is elected, the other two Mesos masters will redirect all client requests to the leader. In this way, we can have any number of Mesos masters for an HA setup and still have only a single leader at any one time.

The Mesos slaves do not require HA, because they are treated as workers that provide resources (CPU, memory, disk space) enabling the Mesos masters to execute tasks. Slaves can be added and removed dynamically. The Marathon servers use a similar leader election mechanism to allow any number of Marathon servers in an HA cluster. In our case we also chose to deploy three Marathon servers for our cluster. The Zookeeper service does have a built-in mechanism for HA, so we chose to deploy three Zookeeper servers in our cluster to take advantage of its HA capabilities.

Building the Zookeeper-Mesos-Marathon cluster

Of course we always have the option of building each server in the cluster manually, but that quickly becomes time-intensive and prone to inconsistencies. Instead of provisioning manually, we created a single base image with all the necessary software. We utilize the OpenStack cloud-init post-install to convert the base image into either a Zookeeper, Mesos Master, Mesos Slave, or Marathon server.

We maintain the cloud-init scripts and customization in github. Instead of using the nova client directly or the web gui to provision, we added another automation feature and wrote Python scripts to call the python-novaclient and pass in the necessary cloud-init and post-install instructions to build a new server. This combines all the necessary steps into a single command. The command provisions the VM, instructs the VM to download the cloud-init post-install script from github, activates the selected service, and joins the new VM to a cluster. As a result, we can easily add servers to an existing cluster as well as create new clusters.

Cluster management with Ansible

Ansible is a distributed systems management tool that helps to ease the management of many servers. It is not too difficult or time-consuming to make changes on one, two, or even a dozen servers, but making changes to hundreds or thousands of servers becomes a non-trivial task. Not only do such changes take a lot of time, but they have a high chance of introducing an inconsistency or error that would cause unforeseen problems.

Ansible is similar to cfengine, puppet, chef, salt, and many other systems management tools. Each tool has its own strengths and weaknesses. One of the reasons we decided to use Ansible is its ability to execute remote commands using ssh without having the need for any Ansible client to run on the servers.

Ansible can be used as a configuration management, software deployment, and do-anything-you-want kind of a tool. It employs a plug-and-play concept, where existing modules have already been written for many functions. For example, there are modules for connecting to hosts with a shell, for AWS EC2 automation, for networking, for user management, etc.

Since we have a large Mesos cluster and several of the servers are for experimentation, we use Ansible extensively to manage the cluster and make consistent changes when necessary across all of the servers.

Conclusion

Depending on the situation and use case, many different models for running CI in Mesos can be tried out. The model that we outlined above is only one. Another variation is a shared master using the plugin and temporary masters running the build directly. In Part II of this blog post, we introduce an advanced use case of running builds in Docker containers on Mesos.