Missing Building Blocks for Enterprise OpenStack: Part 1 – High Availability

It’s a great time to be an OpenStack company – you get the majority of data for marketing and product management by simply talking to customers and partners every day. Nevertheless, the landscape is quite competitive – so both for the community and for individual vendors, it’s important to build and prioritize the feature backlog wisely, all while being clear on who wants what. I’ll call “captain obvious” here, but still, the needs of the Enterprise are quite different from those of a service provider, a government, or some “web-scale” IT shop.

In this blog post (and several to come) I will share some thoughts on features — “building blocks”, really — that OpenStack is still missing, but that are necessary in order for it to be successful in the Enterprise. I’ll also give you some hints as to whether anyone is actually working on bridging the gap, and if so, what solutions exist.

Missing Building Block #1: Enterprise-level HA/Fault Tolerance

HA, or “High Availability”: in the Enterprise, these are probably the two most influential letters in the virtualization/cloud context. In a nutshell, having this feature means that if a Virtual Machine fails for any reason (the operating system crashed, the entire hypervisor node went down, and so on) the datacenter/cloud management platform will bring it back to life ASAP. This may involve rapid restart on the same hypervisor host, or evacuation to another hypervisor host. The “extreme” mode for “VIP” VMs is “Fault Tolerance”, or running a pair of VMs on different hypervisors with CPU/memory state mirroring, so that there’s always a survivor to turn to in case of catastrophe.

Why Does the Enterprise Need HA?

Historically, the success of vSphere in the Enterprise was largely built on the treatment of legacy applications as “pets”. These applications have typically been under heavy development for many years, run on bare metal, and are maintained by dedicated teams. Applications of this kind are usually not “cloud ready”. They have little-to-no native failover intelligence, but they successfully solve for business needs, and have their development budget planned out years ahead of time.

In addition to consolidation on fewer physical servers, vSphere enhances the “quality of life” for these applications by helping them recover from failures, while not requiring them to have any “virtualization/cloud awareness”. To succeed, OpenStack needs to be able to fulfill this same function.

What’s the State of HA in OpenStack?

The good news is that “the bits” necessary for HA are already there, so the effort for building generic “Availability-as-a-Service” for OpenStack is lower than one might expect.

OpenStack has a number of supported shared+distributed storage backends that are feasible for live migration/evacuation (our local Mira

What’s the State of HA in OpenStack?

The good news is that “the bits” necessary for HA are already there, so the effort for building generic “Availability-as-a-Service” for OpenStack is lower than one might expect.

OpenStack has a number of supported shared+distributed storage backends that are feasible for live migration/evacuation (our local Mirantis favorite being Ceph), and Nova even has an implementation of “nova evacuate” – a command that triggers a sequence of API calls for VM evacuation to a different hypervisor host.

What’s missing is the management+monitoring component (and, of course, a nice UI and lots of PR 🙂 ). Some process still need to closely monitor HA-enabled VMs on multiple levels – hypervisor availability, Nova compute sanity, VM ping response, and so on – and, upon making the “ok, it died” decision, trigger evacuation through Nova. And, of course, the system must ensure the success of the evacuation.

The bad news is that OpenStack community has been (and, to some extent, still is) inconsistent in determining OpenStack’s vector in the context of providing application availability. Luckily, the latest Summit, in Atlanta, reinforced the sentiment of “Winning the Enterprise”, and while respecting OpenStack’s DevOps/”cloud ready” roots, many vocal community members now express support for the idea of “having a service that uses the Nova APIs to monitor services or even entire VMs and automatically take action, such as starting another instance from the last snapshot of a cinder volume, creating additional instances and the like”.

The ugly (or perhaps just unfortunate) part is that during this time of community inconsistency, some potential OpenStack adopters might have gotten the wrong message, thinking, “OpenStack will never care about HA beyond it’s own controller infrastructure.” I wonder if we still have time to win these people back.

So now the moment of truth comes – who will write the code, and when it will actually become a useful feature?

The interim solution

One may argue that setting up Nagios or Zabbix with aggressive polling of “pet” VMs and scripts to trigger evacuation is a solution here. That may work in a geeky DIY environment, but I believe that as far as management is concerned, that’s too cumbersome to be practical in the enterprise. Let’s not forget – IT often is still a cost center in the enterprise, so we need to make things easier for these guys, not harder. As we move forward, we can also look at leveraging Heat as a “state machine”, and Ceilometer as an alerting manager, but so far, at least, there are no consistent success stories we can point to.

The real trade-off opportunity here is to start adopting OpenStack by mixing KVM and vSphere hypervisors (assuming that the enterprise has some vSphere licenses). OpenStack can help with self-service/multi-tenancy/orchestration and hosting “cloud ready” apps on KVM, while vSphere will do what it does best – host “pet” applications and make sure that “bare-metal-like” virtualization will keep them happy.

What other features do you think OpenStack still needs to succeed in the Enterprise?

PS: Don’t forget that HA is included in vSphere starting from the Essentials Plus kit, the second least expensive VMWare offering after the ESXi-only Essentials kit, but you will also need a vCenter licence.