Posts Tagged 'Instances'

In December, I posted a MongoDB performance analysis that showed the quantitative benefits of using bare metal servers for MongoDB workloads. It should come as no surprise that in the wake of SoftLayer's Riak launch, we've got some similar data to share about running Riak on bare metal.

To run this test, we started by creating five-node clusters with Riak 1.3.1 on SoftLayer bare metal servers and on a popular competitor's public cloud instances. For the SoftLayer environment, we created these clusters using the Riak Solution Designer, so the nodes were all provisioned, configured and clustered for us automatically when we ordered them. For the public cloud virtual instance Riak cluster, each node was provisioned indvidually using a Riak image template and manually configured into a cluster after all had come online. To optimize for Riak performance, I made a few tweaks at the OS level of our servers (running CentOS 64-bit):

Noatime
Nodiratime
barrier=0
data=writeback
ulimit -n 65536

The common Noatime and Nodiratime settings eliminate the need for writes during reads to help performance and disk wear. The barrier and writeback settings are a little less common and may not be what you'd normally set. Although those settings present a very slight risk for loss of data on disk failure, remember that the Riak solution is deployed in five-node rings with data redundantly available across multiple nodes in the ring. With that in mind and considering each node also being deployed with a RAID10 storage array, you can see that the minor risk for data loss on the failure of a single disk in the entire solution would have no impact on the entire data set (as there are plenty of redundant copies for that data available). Given the minor risk involved, the performance increases of those two settings justify their use.

With all of the nodes tweaked and configured into clusters, we set up Basho's test harness — Basho Bench — to remotely simulate load on the deployments. Basho Bench allows you to create a configurable test plan for a Riak cluster by configuring a number of workers to utilize a driver type to generate load. It comes packaged as an Erlang application with a config file example that you can alter to create the specifics for the concurrency, data set size, and duration of your tests. The results can be viewed as CSV data, and there is an optional graphics package that allows you to generate the graphs that I am posting in this blog. A simplified graphic of our test environment would look like this:

You may notice that in the test cases that use SoftLayer "Medium" Servers, the virtual provider nodes are running 26 virtual compute units against our dual proc hex-core servers (12 cores total). In testing with Riak, memory is important to the operations than CPU resources, so we provisioned the virtual instances to align with the 36GB of memory in each of the "Medium" SoftLayer servers. In the public cloud environment, the higher level of RAM was restricted to packages with higher CPU, so while the CPU counts differ, the RAM amounts are as close to even as we could make them.

One final "housekeeping" note before we dive into the results: The graphs below are pulled directly from the optional graphics package that displays Basho Bench results. You'll notice that the scale on the left-hand side of graphs differs dramatically between the two environments, so a cursory look at the results might not tell the whole story. Click any of the graphs below for a larger version. At the end of each test case, we'll share a few observations about the operations per second and latency results from each test. When we talk about latency in the "key observation" sections, we'll talk about the 99th percentile line — 99% of the results had latency below this line. More simply you could say, "This is the highest latency we saw on this platform in this test." The primary reason we're focusing on this line is because it's much easier to read on the graphs than the mean/median lines in the bottom graphs.

The SoftLayer environment showed much more consistency in operations per second with an average throughput around 450 Op/sec. The virtual environment throughput varied significantly between about 50 operations per second to more than 600 operations per second with the trend line fluctuating slightly between about 220 Op/sec and 350 Op/sec.

Comparing the latency of get and put requests, the 99th percentile of results in the SoftLayer environment stayed around 50ms for gets and under 200ms for puts while the same metric for the virtual environment hovered around 800ms in gets and 4000ms in puts. The scale of the graphs is drastically different, so if you aren't looking closely, you don't see how significantly the performance varies between the two.

Similar to the results of Test 1, the throughput numbers from the bare metal environment are more consistent (and are consistently higher) than the throughput results from the virtual instance environment. The SoftLayer environment performed between 1500 and 1750 operations per second on average while the virtual provider environment averaged around 1200 operations per second throughout the test.

The latency of get and put requests in Test 2 also paints a similar picture to Test 1. The 99th percentile of results in the SoftLayer environment stayed below 50ms and under 400ms for puts while the same metric for the virtual environment averaged about 250ms in gets and over 1000ms in puts. Latency in a big data application can be a killer, so the results from the virtual provider might be setting off alarm bells in your head.

In Test 3, we're using the same specs in our virtual provider nodes, so the results for the virtual node environment are the same in Test 3 as they are in Test 2. In this Test, the SoftLayer environment substitutes SSD hard drives for the 15K SAS drives used in Test 2, and the throughput numbers show the impact of that improved I/O. The average throughput of the bare metal environment with SSDs is between 1750 and 2000 operations per second. Those numbers are slightly higher than the SoftLayer environment in Test 2, further distancing the bare metal results from the virtual provider results.

The latency of gets for the SoftLayer environment is very difficult to see in this graph because the latency was so low throughout the test. The 99th percentile of puts in the SoftLayer environment settled between 500ms and 625ms, which was a little higher than the bare metal results from Test 2 but still well below the latency from the virtual environment.

Summary

The results show that — similar to the majority of data-centric applications that we have tested — Riak has more consistent, better performing, and lower latency results when deployed onto bare metal instead of a cluster of public cloud instances. The stark differences in consistency of the results and the latency are noteworthy for developers looking to host their big data applications. We compared the 99th percentile of latency, but the mean/median results are worth checking out as well. Look at the mean and median results from the SoftLayer SSD Node environment: For gets, the mean latency was 2.5ms and the median was somewhere around 1ms. For puts, the mean was between 7.5ms and 11ms and the median was around 5ms. Those kinds of results are almost unbelievable (and that's why I've shared everything involved in completing this test so that you can try it yourself and see that there's no funny business going on).

It's commonly understood that local single-tenant resources that bare metal will always perform better than network storage resources, but by putting some concrete numbers on paper, the difference in performance is pretty amazing. Virtualizing on multi-tenant solutions with network attached storage often introduces latency issues, and performance will vary significantly depending on host load. These results may seem obvious, but sometimes the promise of quick and easy deployments on public cloud environments can lure even the sanest and most rational developer. Some applications are suited for public cloud, but big data isn't one of them. But when you have data-centric apps that require extreme I/O traffic to your storage medium, nothing can beat local high performance resources.

SoftLayer Private Clouds are officially live, and that means you can now order and provision your very own private cloud infrastructure on Citrix CloudPlatform quickly and easily. Chief Scientist Nathan Day introduced private clouds on the blog when it was announced at Cloud Expo East, and CTO Duke Skarda followed up with an explanation of the architecture powering SoftLayer Private Clouds. The most amazing claim: You can order a private cloud infrastructure and spin up its first virtual machines in a matter of hours rather than days, weeks or months.

If you've ever looked at building your own private cloud in the past, the "days, weeks or months" timeline isn't very surprising — you have to get the hardware provisioned, the software installed and the network configured ... and it all has to work together. Hearing that SoftLayer Private Clouds can be provisioned in "hours" probably seems too good to be true to administrators who have tried building a private cloud in the past, so I thought I'd put it to the test by ordering a private cloud and documenting the experience.

At 9:30am, I walked over to Phil Jackson's desk and asked him if he would be interested in helping me out with the project. By 9:35am, I had him convinced (proof), and the clock was started.

When we started the order process, part of our work is already done for us:

To guarantee peak performance of the CloudPlatform management server, SoftLayer selected the hardware for us: A single processor quad core Xeon 5620 server with 6GB RAM, GigE, and two 2.0TB SATA II HDDs in RAID1. With the management server selected, our only task was choosing our host server and where we wanted the first zone (host server and management server) to be installed:

For our host server, we opted for a dual processor quad core Xeon 5504 with the default specs, and we decided to spin it up in DAL05. We added (and justified) a block of 16 secondary IP addresses for our first zone, and we submitted the order. The time: 9:38am.

At this point, it would be easy for us to game the system to shave off a few minutes from the provisioning process by manually approving the order we just placed (since we have access to the order queue), but we stayed true to the experiment and let it be approved as it normally would be. We didn't have to wait long:

At 9:42am, our order was approved, and the pressure was on. How long would it take before we were able to log into the CloudStack portal to create a virtual machine? I'd walked over to Phil's desk 12 minutes ago, and we still had to get two physical servers online and configured to work with each other on CloudPlatform. Luckily, the automated provisioning process took on a the brunt of that pressure.

Both server orders were sent to the data center, and the provisioning system selected two pieces of hardware that best matched what we needed. Our exact configurations weren't available, so a SBT in the data center was dispatched to make the appropriate hardware changes to meet our needs, and the automated system kicked into high gear. IP addresses were assigned to the management and host servers, and we were able to monitor each server's progress in the customer portal. The hardware was tested and prepared for OS install, and when it was ready, the base operating systems were loaded — CentOS 6 on the management server and Citrix XenServer 6 on the host server. After CentOS 6 finished provisioning on the management server, CloudStack was installed. Then we got an email:

At 11:24am, less than two hours from when I walked over to Phil's desk, we had two servers online and configured with CloudStack, and we were ready to provision our first virtual machines in our private cloud environment.

We log into CloudStack and added our first instance:

We configured our new instance in a few clicks, and we clicked "Launch VM" at 11:38am. It came online in just over 3 minutes (11:42am):

I got from "walking to Phil's desk" to having a multi-server private cloud infrastructure running a VM in exactly two hours and twelve minutes. For fun, I created a second VM on the host server, and it was provisioned in 31.7 seconds. It's safe to say that the claim that SoftLayer takes "hours" to provision a private cloud has officially been confirmed, but we thought it would be fun to add one more wrinkle to the system: What if we wanted to add another host server in a different data center?

From the "Hardware" tab in the SoftLayer portal, we selected "Add Zone" to from the "Actions" in the "Private Clouds" section, and we chose a host server with four portable IP addresses in WDC01. The zone was created, and the host server went through the same hardware provisioning process that our initial deployment went through, and our new host server was online in < 2 hours. We jumped into CloudStack, and the new zone was created with our host server ready to provision VMs in Washington, D.C.

Given how quick the instances were spinning up in the first zone, we timed a few in the second zone ... The first instance was online in about 4 minutes, and the second was running in 26.8 seconds.

By the time I went out for a late lunch at 1:30pm, we'd spun up a new private cloud infrastructure with geographically dispersed zones that launched new cloud instances in under 30 seconds. Not bad.