June 10, 2013

Big Data on OpenStack

Big Data systems by their very nature tend to be...Big.

Big in the amount of data, in the number and size of infrastructure behind it, etc. Cloud based infrastructure can be a good fit as a cost effecive instrastrucutre for running those Big Data systems. While this may sound obvious, many of the Big Data deployments that I've encountered run outside of a cloud environment. The main reason is the performance overhead that is often associated in running on a virtualized cloud environment. Let me explain..

The Performance Overhead of Running Big Data on the Cloud

A recent benchmark done by Petersenna over various virtualization solutions, such as XEN, VMWare, HyperV etc, shows that on average the performance overhead associated with virtualization can lead to 2.4 times slower disk latency and 25% slower network I/O.

How does the performance overhead translate into cost?

Based on the various benchmarks, I think that it would be fair to assume that running I/O intrusive workloads such as Big Data on a virtualized infrastructure would require 3X more resources than its Baremetal equivalent.

As we are talking about Big Data infrastructure - 3x means a lots of wasted resources. The operational costs coupled with the complexity associated with running a significantly bigger system yields a fairly substantial overhead and cost. In addition, the Network and Disk overhead are not deterministic and can vary quite substantially when the utilization gets higher. This leads to not just performance overhead, but also non-deterministic overhead. In Big Data terms, that means that a query for particular data can take 10msec one time or 30msec another time. Quite often, running Big Data analysis requires a sequence of those operations. Therefore, if we pile this overhead in a sequence of 10 queries, the variance in the response time can vary quite substantially.

While the choice of running Big Data on the cloud holds a lot of promise, the performance and non-deterministic behavior makes that choice limited to more of a niche scenario where this overhead becomes less significant in comparison to the elasticity benefit of the cloud. For example, for sporadic worklads in which we run our analysis for a certain period of time and then can release the resources, using on-demand infrastructure is still a better choice, as running 3x the amount of resources for an hour is significantly cheaper than having a third of that infrastructure allocated 24/7.

OpenStack Bare Metal Cloud

The analysis above shows that for I/O intensive workloads, virtualized infrastructure isn't such a good fit.

Cloud is often viewed as an infrastructure on top of virtualization. Therefore, by definition, a cloud-based infrastructure inherits the benefits and limitations of virtualization.

Is the coupling between cloud and virtualization really mandatory?

If we think of a cloud as an infrastructure for getting compute, storage and network resources on-demand, then there is nothing to necessitate the coupling of the cloud with virtualization. It became common to pair the two mainly because of the complexity involved with provisioning non-virtualized resources and the limitations in enabling partitioning of a given bare metal machine.

As I noted in one of my previous posts Bare Metal Cloud/PaaS, there are ways today to provision a baremetal machine from an image and partition it just like we would do with a hypervisor based VM. There are already cloud providers that offer a choice of bare metal machines as part of their cloud infrastructure.

A new bare metal project in the OpenStack Grizzly release takes this a step further. It allows us to use the same compute API (NOVA) and allocate a bare metal machine just as we would with a virtualized machine.

All we need to do to make the switch is to change the image type of our target machine and the cloud infrastructure will know to map that request and allocate a bare metal image instead of a virtualized instance.

With this option we can now run our entire Big Data workload on the cloud and not worry about switching environments depending on the workload happen to be sporadic or I/O intesive.

HubSpot OpenStack – Bare Metal Case Study

During the last OpenStack Summit, Jim O’Neill CIO at HubSpot showed how using OpenStack with a combination of a public virtualized cloud and private bare metal cloud enabled a 4X increase in their infrastructure efficiency:

“We took this single image, picked it up from public cloud into a Rackspace-powered private cloud and saw a 4X increased efficiency running that workload.”

Moving from Existing Data Centers to the Cloud

Many existing Big Data and BI systems run on traditional data center environments. Moving those systems into an OpenStack-based environment isn't going to be a walk in the park. This is where automation frameworks, such as Cloudify with the combination of Chef and Puppet, make this transition smoother.

In this approach, we can automate the deployment of our existing Big Data/BI systems in a way that will be abstracted from the underlying infrastructure. We can later use this abstraction to run Big Data systems in our existing data center, and when ready, we can use the same deployment framework on an OpenStack-based environment without re-doing any of that investment.

Learn more through hands-on experience - a Real Life Experience on HP OpenStack

As it often happen in this sort of conceptial discussions the points and arguments from this post may often sound artificial and not easy to grasp. To make it more down to earth, we've put a NoSQL datastore such as Couchbase, MongoDB, Cassandra, ElasticSearch and Big Data applications available on-demand on HP OpenStack cloud services. You can use this reference to launch any of those applications and then use the management console to browse through the recipes and customize it to your environment as needed.

It is also worth pointing out that the recipes behind this project are available on Github and can be easily deployed on your cloud or in your data center or even desktop in pretty much the same way. If you are interested in more details, please post a comment on this post or in the Cloudify forum.

For more on Big Data on OpenStack, come check out my presentation at Cloud Expo East on Tuesday, June 11th at 8:15am in the “Cloud Computing and Big Data” Track.

TrackBack

Comments

Big Data on OpenStack

Big Data systems by their very nature tend to be...Big.

Big in the amount of data, in the number and size of infrastructure behind it, etc. Cloud based infrastructure can be a good fit as a cost effecive instrastrucutre for running those Big Data systems. While this may sound obvious, many of the Big Data deployments that I've encountered run outside of a cloud environment. The main reason is the performance overhead that is often associated in running on a virtualized cloud environment. Let me explain..

The Performance Overhead of Running Big Data on the Cloud

A recent benchmark done by Petersenna over various virtualization solutions, such as XEN, VMWare, HyperV etc, shows that on average the performance overhead associated with virtualization can lead to 2.4 times slower disk latency and 25% slower network I/O.

How does the performance overhead translate into cost?

Based on the various benchmarks, I think that it would be fair to assume that running I/O intrusive workloads such as Big Data on a virtualized infrastructure would require 3X more resources than its Baremetal equivalent.

As we are talking about Big Data infrastructure - 3x means a lots of wasted resources. The operational costs coupled with the complexity associated with running a significantly bigger system yields a fairly substantial overhead and cost. In addition, the Network and Disk overhead are not deterministic and can vary quite substantially when the utilization gets higher. This leads to not just performance overhead, but also non-deterministic overhead. In Big Data terms, that means that a query for particular data can take 10msec one time or 30msec another time. Quite often, running Big Data analysis requires a sequence of those operations. Therefore, if we pile this overhead in a sequence of 10 queries, the variance in the response time can vary quite substantially.

While the choice of running Big Data on the cloud holds a lot of promise, the performance and non-deterministic behavior makes that choice limited to more of a niche scenario where this overhead becomes less significant in comparison to the elasticity benefit of the cloud. For example, for sporadic worklads in which we run our analysis for a certain period of time and then can release the resources, using on-demand infrastructure is still a better choice, as running 3x the amount of resources for an hour is significantly cheaper than having a third of that infrastructure allocated 24/7.

OpenStack Bare Metal Cloud

The analysis above shows that for I/O intensive workloads, virtualized infrastructure isn't such a good fit.

Cloud is often viewed as an infrastructure on top of virtualization. Therefore, by definition, a cloud-based infrastructure inherits the benefits and limitations of virtualization.

Is the coupling between cloud and virtualization really mandatory?

If we think of a cloud as an infrastructure for getting compute, storage and network resources on-demand, then there is nothing to necessitate the coupling of the cloud with virtualization. It became common to pair the two mainly because of the complexity involved with provisioning non-virtualized resources and the limitations in enabling partitioning of a given bare metal machine.

As I noted in one of my previous posts Bare Metal Cloud/PaaS, there are ways today to provision a baremetal machine from an image and partition it just like we would do with a hypervisor based VM. There are already cloud providers that offer a choice of bare metal machines as part of their cloud infrastructure.

A new bare metal project in the OpenStack Grizzly release takes this a step further. It allows us to use the same compute API (NOVA) and allocate a bare metal machine just as we would with a virtualized machine.

All we need to do to make the switch is to change the image type of our target machine and the cloud infrastructure will know to map that request and allocate a bare metal image instead of a virtualized instance.

With this option we can now run our entire Big Data workload on the cloud and not worry about switching environments depending on the workload happen to be sporadic or I/O intesive.

HubSpot OpenStack – Bare Metal Case Study

During the last OpenStack Summit, Jim O’Neill CIO at HubSpot showed how using OpenStack with a combination of a public virtualized cloud and private bare metal cloud enabled a 4X increase in their infrastructure efficiency:

“We took this single image, picked it up from public cloud into a Rackspace-powered private cloud and saw a 4X increased efficiency running that workload.”

Moving from Existing Data Centers to the Cloud

Many existing Big Data and BI systems run on traditional data center environments. Moving those systems into an OpenStack-based environment isn't going to be a walk in the park. This is where automation frameworks, such as Cloudify with the combination of Chef and Puppet, make this transition smoother.

In this approach, we can automate the deployment of our existing Big Data/BI systems in a way that will be abstracted from the underlying infrastructure. We can later use this abstraction to run Big Data systems in our existing data center, and when ready, we can use the same deployment framework on an OpenStack-based environment without re-doing any of that investment.

Learn more through hands-on experience - a Real Life Experience on HP OpenStack

As it often happen in this sort of conceptial discussions the points and arguments from this post may often sound artificial and not easy to grasp. To make it more down to earth, we've put a NoSQL datastore such as Couchbase, MongoDB, Cassandra, ElasticSearch and Big Data applications available on-demand on HP OpenStack cloud services. You can use this reference to launch any of those applications and then use the management console to browse through the recipes and customize it to your environment as needed.

It is also worth pointing out that the recipes behind this project are available on Github and can be easily deployed on your cloud or in your data center or even desktop in pretty much the same way. If you are interested in more details, please post a comment on this post or in the Cloudify forum.

For more on Big Data on OpenStack, come check out my presentation at Cloud Expo East on Tuesday, June 11th at 8:15am in the “Cloud Computing and Big Data” Track.