Offloading your performance and scalability concerns seems like a great idea, but it's not necessarily feasible

InfoWorld|Jul 8, 2013

Certain elements of cloud computing are inarguably beneficial. The ability to quickly provision, clone, and deploy servers to address capacity issues is a definite plus, and the fact that you can effortlessly add elements such as load balancers, big storage, and databases is equally compelling. However, there is a downside that needs to be understood: the fact that, in many cases, cloud server instances can exhibit wildly different performance metrics, measured from second to second.

Sure, SLAs and guarantees made by the cloud provider can address these issues after the fact, but when it's crunch time, there's nothing you can do other than communicate the problem and hope it gets fixed quickly. You may have visibility into what your instances are doing, but you have no idea what the underlying hardware is doing, how it's configured, or how oversubscribed it might be. That's the nature of the cloud.

There's no secret involved in how cloud providers operate. They have built tools around various virtualization platforms to allow for self-service VM provisioning, and they've tied in other value-adds, but in the end, it all comes down to the same basic virtualization frameworks that we use in-house today. Those, however, we control both inside and out. In the cloud, we're stuck on the inside.

Let's say we're deploying a service on a public cloud that will require some number of Web servers, at least one database server, and a load balancer. The load balancer is run by the cloud provider, and we have no visibility into its load or performance. As a result, if we start to see the number of incoming connections drop, we won't know whether the incoming load is dipping or if there's a capacity issue with the load balancer. Or to take another example, we might find the performance of our Web servers varying greatly, despite being identically configured -- as I recently saw with a cloud server instance.

As I was turning up a test instance with a well-known cloud provider, I ran some tests of the underlying server. Essentially, I used Apache's ab benchmarking tool to measure the performance of Nginx on the host, so I was hitting the server from itself, requesting the same PNG file 3,000 times, with 20 concurrent connections and a second rest between each test run. The Nginx configuration was set to cache these files, so it was simply pulling the image out of RAM and shipping it, no disk I/O involved.

Over the course of six test passes, I witnessed a high of 7,749.1 requests per second and a low of 4,754.43 requests per second. The average across all tests was 6,499.19 requests per second. This represents a substantial range in results, making it a challenge to forecast scalability metrics.

On the other hand, running the same tests against an in-house VM with the same number of vCPUs and RAM, I saw a high of 15,176.66 requests per second and a low of 14,507.47 requests per second, with an average of 14,829.52 requests per second. These are obviously much more consistent results. They're also nearly three times higher than the results from the cloud instance.

The CPUs in use were different, which accounts for some of the disparity. (The in-house VM was using Intel Xeon E5-2670 CPUs at 2.6GHz with 20MB cache, while the cloud instance was running on AMD Opteron 4332 HE CPUs at 3.0GHz with 2MB cache.) But that's not the whole story. I should have seen equally consistent results, however slower, on the cloud instance. Instead, I have a range of almost 50 percent of the average result. (This compares to a range of less than 5 percent of the average for the in-house VM.)

Cloud servers are generally sold by the vCPU count, RAM, and bandwidth utilization, but clearly not all instances are created equally. Even if they were, the capacity of the underlying hardware can vary wildly depending on where the instance happens to be. The solution from a purely operational standpoint is to overbuild your cloud infrastructure to account for these large performance disparities, but that has its own pitfalls, including extra costs.

Also, even if the instance is overbuilt, performance can suffer if the load balancer is trying to level an incoming load to several instances performing at different levels, even though they're ostensibly identical in spec. If the load balancer is directing traffic to the least loaded instance measured by number of connections, that instance may very well be underperforming as compared to other "identical" instances, which may be handling more connections but are actually operating faster.

There is no good solution to this, other than maintaining vigilance and pressing your cloud provider to deliver what was promised. I'd recommend running scheduled performance tests on your instances to check their performance levels over time and using these results as ammunition in discussions with the provider.

The reasons for using the cloud are many, but cloud servers are certainly not the hands-free panacea they might seem to be. You may reduce some responsibilities, but you will gain others.