from VMware's performance team

Tag Archives: virtualization

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and in a native configuration. Briefly,

We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.

Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

Native: Linux OS running directly on hardware (Ubuntu, CentOS)

vSphere VM: Upcoming release of vSphere with the same guest OS as native

Native-Docker: Docker version 1.2 running on a native OS

VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

- Micro-benchmarks:

LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.

STREAM: This benchmark measures memory bandwidth across various configurations.

FIO: This benchmark is used for I/O benchmarking for block devices and file systems.

Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

Figure 5: Netperf Recieve performance for different test scenarios

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access. The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys. This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run. The results of these experiments are summarized in the following chart.

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead. However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty. This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems. The results will be covered in the next installment of this series. Stay tuned!

Powering and cooling are a substantial portion of datacenter costs. Ideally, we could minimize these costs by optimizing the datacenter’s energy consumption without impacting performance. The Host Power Management feature, which has been enabled by default since ESXi 5.0, allows hosts to reduce power consumption while boosting energy efficiency by putting processors into a low-power state when not fully utilized.

Power management can be controlled by the either the BIOS or the operating system. In the BIOS, manufacturers provide several types of Host Power Management policies. Although they vary by vendor, most include “Performance,” which does not use any power saving techniques, “Balanced,” which claims to increase energy efficiency with minimal or no impact to performance, and “OS Controlled,” which passes power management control to the operating system. The “Balanced” policy is variably known as “Performance per Watt,” “Dynamic” and other labels; consult your vendor for details. If “OS Controlled” is enabled in the BIOS, ESXi will manage power using one of the policies “High performance,” “Balanced,” “Low power,” or “Custom.” We chose to study Balanced because it is the default setting.

But can the Balanced setting, whether controlled by the BIOS or ESXi, reduce performance relative to the Performance setting? We have received reports from customers who have had performance problems while using the BIOS-controlled Balanced setting. Without knowing the effect of Balanced on performance and energy efficiency, when performance is at a premium users might select the Performance policy to play it safe. To answer this question we tested the impact of power management policies on performance and energy efficiency using VMmark 2.5.

VMmark 2.5 is a multi-host virtualization benchmark that uses varied application workloads as well as common datacenter operations to model the demands of the datacenter. VMs running diverse application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.

We tested three policies: the BIOS-controlled Performance setting, which uses no power management techniques, the ESXi-controlled Balanced setting (with the BIOS set to OS-Controlled mode), and the BIOS-controlled Balanced setting. The ESXi Balanced and BIOS-controlled Balanced settings cut power by reducing processor frequency and voltage among other power saving techniques.

We found that the ESXi Balanced setting did an excellent job of preserving performance, with no measurable performance impact at all levels of load. Not only was performance on par with expectations, but it did so while producing consistent improvements in energy efficiency, even while idle. By comparison, the BIOS Balanced setting aggressively saved power but created higher latencies and reduced performance. The following results detail our findings.

Testing Methodology
All tests were conducted on a four-node cluster running VMware vSphere 5.1. We compared performance and energy efficiency of VMmark between three power management policies:Performance, the ESXi-controlled Balanced setting, and the BIOS-controlled Balanced setting, also known as “Performance per Watt (Dell Active Power Controller).”

Results
To determine the maximum VMmark load supported for each power management setting, we increased the number of VMmark tiles until the cluster reached saturation, which is defined as the largest number of tiles that still meet Quality of Service (QoS) requirements. All data points are the mean of three tests in each configuration and VMmark scores are normalized to the BIOS Balanced one-tile score.

The VMmark scores were equivalent between the Performance setting and the ESXi Balanced setting with less than a 1% difference at all load levels. However, running on the BIOS Balanced setting reduced the VMmark scores an average of 15%. On the BIOS Balanced setting, the environment was no longer able to support nine tiles and, even at low loads, on average, 31% of runs failed QoS requirements; only passing runs are pictured above.

We also compared the improvements in energy efficiency of the two Balanced settings against the Performance setting. The Performance per Kilowatt metric, which is new to VMmark 2.5, models energy efficiency as VMmark score per kilowatt of power consumed. More efficient results will have a higher Performance per Kilowatt.

Two trends are visible in this figure. As expected, the Performance setting showed the lowest energy efficiency. At every load level, ESXi Balanced was about 3% more energy efficient than the Performance setting, despite the fact that it delivered an equivalent score to Performance. The BIOS Balanced setting had the greatest energy efficiency, 20% average improvement over Performance.

Second, increase in load is correlated with greater energy efficiency. As the CPUs become busier, throughput increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no work to show for it. A highly utilized server is typically the most energy efficient per request completed, which is confirmed in our results. Higher energy efficiency creates cost savings in host energy consumption and in cooling costs.

The bursty nature of most environments leads them to sometimes idle, so we also measured each host’s idle power consumption. The Performance setting showed an average of 128 watts per host, while ESXi Balanced and BIOS Balanced consumed 85 watts per host. Although the Performance and ESXi Balanced settings performed very similarly under load, hosts using ESXi Balanced and BIOS Balanced power management consumed 33% less power while idle.

VMmark 2.5 scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how power management policies affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest results.

Whereas the Performance and ESXi Balanced latencies tracked closely, BIOS Balanced latencies were significantly higher at all load levels. Furthermore, latencies were unpredictable even at low load levels, and for this reason, 31% of runs between one and eight tiles failed; these runs are omitted from the figure above. For example, half of the BIOS Balanced runs did not pass QoS requirements at four tiles. These higher latencies were the result of aggressive power saving by the BIOS Balanced policy.

Our tests showed that ESXi’s Balanced power management policy didn’t affect throughput or latency compared to the Performance policy, but did improve energy efficiency by 3%. While the BIOS-controlled Balanced policy improved power efficiency by an average of 20% over Performance, it was so aggressive in cutting power that it often caused VMmark to fail QoS requirements.

Overall, the BIOS controlled Balanced policy produced substantial efficiency gains but with unpredictable performance, failed runs, and reduced performance at all load levels. This policy may still be suitable for some workloads which can tolerate this unpredictability, but should be used with caution. On the other hand, the ESXi Balanced policy produced modest efficiency gains while doing an excellent job protecting performance across all load levels. These findings make us confident that the ESXi Balanced policy is a good choice for most types of virtualized applications.

A system’s performance is often limited by the access time of its hard disk drive (HDD). Solid-state drives (SSDs), also known as Enterprise Flash Drives (EFDs), tout a superior performance profile to HDDs. In our previous comparison of EFD and HDD technologies using VMmark 2.1, we showed that EFD reads were on average four times faster than HDD reads, while EFD and HDD write speeds were comparable. However, EFDs are more costly per gigabyte.

Many vendors have attempted to address this issue using tiered storage technologies. Here, we tested the performance benefits of EMC’s FAST Cache storage array feature, which merges the strengths of both technologies. FAST Cache is an EFD-based read/write storage cache that supplements the array’s DRAM cache by giving frequently accessed data priority on the high performing EFDs. We used VMmark 2, a multi-host virtualization benchmark, to quantify the performance benefits of FAST Cache. For more details, see the overview, release notes for VMmark 2.1, and release notes for 2.1.1. VMmark 2 is an ideal tool to test FAST Cache performance for virtualized datacenters in that its varied workloads and bursty I/O patterns model the demands of the datacenter. We found that FAST Cache produced remarkable improvements in datacenter capacity and storage access latencies. With the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements.

FAST Cache
FAST Cache is a feature of EMC’s storage systems that tracks frequently accessed data on disk, promotes the data into an array-wide EFD cache to take advantage of Flash I/O access speeds, then writes it back to disk when the data is superseded in importance. FAST Cache optimizes the use of EFD storage. In most workloads only a small percentage of data will be frequently accessed. This is referred to as the ‘working set.’ An EFD-based cache allows the data in the working set to take advantage of the performance characteristics of EFDs while the rest of the data stays on lower-cost HDDs. Relevant data is rapidly promoted into the cache in increments of 64 KB pages, and a least-recently-used algorithm is used to decide which data to write back to disk.

The benefit achieved with FAST Cache depends on the workload’s I/O profile. As with most caches, FAST Cache will show the most benefit for I/O with a high locality of reference, such as database indices and reference tables. FAST Cache will be least beneficial to workloads with sequential I/O patterns like log files or large I/O size access because these may not access the same 64 KB block multiple times and the FAST Cache would never become populated.

Methodology
We used VMmark 2 to investigate several different factors relating to FAST Cache. We wanted to measure the performance benefit afforded by adding FAST Cache into a VMmark 2 environment and we wanted to observe how the performance benefit of FAST Cache would scale as we changed the size of the cache. We tested with FAST Cache disabled and with two different FAST Cache sizes which were made from two EFDs and eight EFDs in RAID 1, creating a cache of 92 GB and 366 GB usable space, respectively. FAST Cache was configured according to best practices to ensure FAST Cache performance was not limited by array bus bandwidth. After the FAST Cache was created, it was warmed up by repeating VMmark 2 runs until scores showed less than 3% variability between runs.

We also wanted to examine whether FAST Cache could reduce the hardware requirements of our tests. As processors and other system hardware components have increased in capacity and speed, there has been greater and greater pressure for corresponding increasing performance from storage. RAID groups of HDDs have been one answer to these increasing performance demands, as RAID arrays provide performance and reliability benefits over individual disks. In typical RAID configurations, performance increases nearly linearly as disks are added to the RAID group. However, adding disks in order to increase storage access speed can result in underutilization of HDD space, which becomes far greater than required. FAST Cache should allow us to reduce the number of HDDs we require for RAID performance benefits, also reducing the cluster’s total power, cooling and space requirements, which results in lower cost. FAST Cache services the bulk of the workloads’ I/O operations at high speeds, so it is acceptable for us to service the remainder of operations at lower speeds and use only as many HDDs as needed for storage capacity rather than performance.

To test whether an environment with FAST Cache and a reduced number of disks could perform as well as an environment without FAST Cache, but with a larger number of disks, we tested performance with two different disk configurations. Workloads were tested on a set of 20 HDDs and then on a set of 11 HDDs, in both cases grouped into three LUNs. Each LUN was in a distinct RAID 0 group. Due to the performance characteristics of RAID 0, we expected the 20 HDD configuration to have better performance to than the 11 HDD configuration. The placement of workloads onto LUNs was meant to model a naïve environment with nonoptimal storage setup. Two LUNs held workload tile data, and the third smaller LUN served as the destination for VM Deploy and Storage vMotion workloads. The first LUN held VMs from the first and third tiles, and the second LUN held VMs from the second and fourth tiles. Running VMmark 2 with more than one tile per LUN was atypical of our best practices for the benchmark. It created a severe bottleneck for the disk, which was meant to simulate the types of storage performance issues we sometimes see in customer environments.

All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers running ESXi 5.0. All components in the environment besides FAST Cache and number of HDDs remained unchanged during testing.

Results
To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare performance at each level of load for each cache and storage configuration. All data points are the mean of three tests in each configuration. Scaling data was generated by normalizing every score to the lowest passing score, which was 1 tile with FAST Cache disabled on 20 HDDs.

With FAST Cache disabled, the 20 HDD LUNs reached saturation at 2 tiles, and the 11 HDD LUNs were unable to support even 1 tile of load. Because all VMs for each tile were placed on the same LUN, a 1 tile run used one LUN, consisting of only four out of 11 HDDs or eight out of 20 HDDs. 4 HDDs were insufficient to provide the required QoS for even 1 tile. When FAST Cache was enabled, the 11 HDD and 20 HDD configurations supported 4 tiles. This is a remarkable improvement; with the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements. Even at lower load levels, the equivalent system with FAST Cache was allowing greater throughput and showed resulting increases in the VMmark score of 26% at 1 tile and 31% at 2 tiles. With FAST Cache enabled, the configuration with 11 HDDs performed equivalently to one with 20 HDDs until the system approached saturation.

With FAST Cache enabled, the system supported twice as much load on almost half as many disks. The results show that an environment with a 92 GB FAST Cache was able to greatly outperform a HDD-only environment that contains 82% more disks. At 4 tiles with FAST Cache enabled, the cluster’s CPU utilization was approaching saturation, reaching an average of 84%, but was not yet bottlenecked on storage.

In our tests, performance did not scale up very much as we increased FAST Cache size from 92 GB to 366 GB and the number of HDDs from 11 to 20.

We can see that all configurations scaled very similarly from 1 to 3 tiles with only minor differences appearing, primarily between the 92 GB FAST Cache and 366 GB FAST Cache. Only at the highest load level did performance begin to diverge. Predictably, the largest cache configurations show the best performance at 4 tiles, followed by the smaller cache configurations. To determine whether this performance falloff was directly attributable to the cache size and number of HDDs, we needed to know whether FAST Cache was performing to capacity.

Below are the FAST Cache and DRAM cache hit percentages for read and write operations at the 4 tile load. On average, our VMmark testing had I/O operations of 24% reads and 76% writes.

Click to Enlarge

With the 366 GB FAST Cache, nearly all reads and writes were hitting either the DRAM or FAST Cache. In these cases, the number of backing disks did not affect the score because disks were rarely being accessed. At this cache size, all frequently accessed data fit into the FAST Cache. However, with the 92 GB FAST Cache, the cache hit percentage decreased to 96.5% and 92.1% for the 11 HDD and 20 HDD configurations, respectively. This indicated that the entire working set could no longer fit into the 92 GB FAST Cache. The 11 HDD configuration began to show decreased performance relative to 20 HDDs, because although only 3.5% of total I/O operations were going to disk, the increase in disk latency was large enough to reduce throughput and affect VMmark score. Despite this, a FAST Cache of 92 GB was still sufficient to provide us with VMmark performance that met QoS requirements. The higher read hit percentages in the 11 HDD configuration reflected this reduced throughput. Lower throughput resulted in a smaller working set and an accordingly higher read hit percentage.

Overall, FAST Cache did an excellent job of identifying the working set. Although only 8% of the 1.09 TB dataset could fit in the 92 GB cache at any one time, at least 92% of I/O requests were hitting the cache.

Scaling FAST Cache gave us a sense of the working set size of the VMmark benchmark. As performance with the 92 GB FAST Cache demonstrated a knee at 3 tiles, this suggests the working set size at 3 tiles is less than 92 GB and the working set size at 4 tiles is slightly greater than 92 GB. Knowing the approximate working set size per tile would allow us to select the minimum FAST Cache size required if we wanted our entire working set to fit into the FAST Cache, even if we scaled the benchmark to an arbitrary number of tiles in a different cluster.

The results below show that I/O operations per second and I/O latency were affected by our environment characteristics.

The variability in read latency is clearly affected by both FAST Cache size and number of backing HDDs. Latency is highest with only 11 HDDs and the smaller FAST Cache, and decreases as we add HDDs. Latency decreases even more with the larger FAST Cache size as nearly all reads hit the cache. Write latency, however, is relatively constant across configurations, which is as expected because in each configuration nearly all writes are being served by either the DRAM cache or FAST Cache.

Summary
It’s clear that we can replace a large number of HDDs with a much smaller number of EFDs and get similar or improved performance results. An array with 11 HDDs and FAST Cache outperformed an array with 20 HDDs without FAST Cache. FAST Cache handles the workloads’ performance requirements so that we need only to supply the HDDs necessary for their storage space, rather than performance capabilities. This allows us to reduce the number of HDDs and their associated power, space, cooling, and cost.

Tiered storage solutions like FAST Cache make excellent use of EFDs, even to the extent that 92% or more of our I/O operations are benefitting from Flash-level latencies while the EFD storage itself holds only 8% of our total data. The increased VMmark scores demonstrate the ability of FAST Cache to pinpoint the most active data remarkably well, and, even in a bursty environment, show incredible improvements in I/O latency and in the load that a cluster can support. Our testing showed FAST Cache provides Flash-level storage access speeds to the data that needs it most, reduces storage bottlenecking and increases supported load, making FAST Cache a highly valuable addition to the datacenter.

Performance benchmarking is often conducted on top-of-the-line hardware, including hosts that typically have a large number of cores, maximum memory, and the fastest disks available. Hardware of this caliber is not always accessible to small or medium-sized businesses with modest IT budgets. As part of our ongoing investigation of different ways to benchmark the cloud using the newly released VMmark 2.0, we set out to determine whether a cluster of less powerful hosts could be a viable alternative for these businesses. We used VMmark 2.0 to see how a four-host cluster with a modest hardware configuration would scale under increasing load.

Workload throughput is often limited by disk performance, so the tests were repeated with two different storage arrays to show the effect that upgrading the storage would offer in terms of performance improvement. We tested two disk arrays that varied in both speed and number of disks, an EMC CX500 and an EMC CX3-20, while holding all other characteristics of the testbed constant.

To review, VMmark 2.0 is a next-generation, multi-host virtualization benchmark that models application performance and the effects of common infrastructure operations such as vMotion, Storage vMotion, and a virtual machine deployment. Each tile contains Microsoft Exchange 2007, DVD Store 2.1, and Olio application workloads which run in a throttled fashion. The Storage vMotion and VM deployment infrastructure operations require the user to specify a LUN as the storage destination. The VMmark 2.0 score is computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details about VMmark 2.0, see the VMmark 2.0 website or Joshua Schnee’s description of the benchmark.

Configuration All tests were conducted on a cluster of four Dell PowerEdge R310 hosts running VMware ESX 4.1 and managed by VMware vCenter Server 4.1. These are typical of today’s entry-level servers; each server contained a single quad-core Intel Xeon 2.80 GHz X3460 processor (with hyperthreading enabled) and 32 GB of RAM. The servers also used two 1Gbit NICs for VM traffic and a third 1Gbit NIC for vMotion activity.

To determine the relative impact of different storage solutions on benchmark performance, runs were conducted on two existing storage arrays, an EMC CX500 and an EMC CX3-20. For details on the array configurations, refer to Table 1 below. VMs were stored on identically configured ‘application’ LUNs, while a designated ‘maintenance’ LUN was used for the Storage vMotion and VM deployment operations.

Table 1. Disk Array Configuration

Results To measure the cluster's performance scaling under increasing load, we started by running one tile, then increased the number of tiles until the run failed to meet Quality of Service (QoS) requirements. As load is increased on the cluster, it is expected that the application throughput, CPU utilization, and VMmark 2.0 scores will increase; the VMmark score increases as a function of throughput. By scaling out the number of tiles, we hoped to determine the maximum load our four-host cluster of entry-level servers could support. VMmark 2.0 scores will not scale linearly from one to three tiles because, in this configuration, the infrastructure operations load remained constant. Infrastructure load increases primarily as a function of cluster size. Although showing only a two host cluster, the relationship between application throughput, infrastructure operations throughput and number of tiles is demonstrated more clearly by this figure from Joshua Schnee’s recent blog article. Secondly, we expected to see improved performance when running on the CX3-20 versus the CX500 because the CX3-20 has a larger number of disks per LUN as well as faster individual drives. Figure 1 below details the scale out performance on the CX500 and the CX3-20 disk arrays using VMmark 2.0.

Figure 1. VMmark 2.0 Scale Out On a Four-Host Cluster

Both configurations saw improved throughput from one to three tiles but at four tiles they failed to meet at least one QoS requirement. These results show that a user wanting to maintain an average cluster CPU utilization of 50% on their four-host cluster could count on the cluster to support a two-tile load. Note that in this experiment, increased scores across tiles are largely due to increased workload throughput rather than an increased number of infrastructure operations.

As expected, runs using the CX3-20 showed consistently higher normalized scores than those on the CX500. Runs on the CX3-20 outperformed the CX500 by 15%, 14%, and 12% on the one, two, and three-tile runs, respectively. The increased performance of the CX3-20 over the CX500 was accompanied by approximately 10% higher CPU utilization, which indicated that that the faster CX3-20 disks allowed the CPU to stay busier, increasing total throughput.

The results show that our cluster of entry-level servers with a modest disk array supported approximately 220 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second. A more robust disk array supported 270 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second with 20% lower latencies on average and a correspondingly slightly higher CPU utilization.

Note that this type of experiment is possible for the first time with VMmark 2.0; VMmark 1.x was limited to benchmarking a single host but the entry-level servers under test in this study would not have been able to support even a single VMmark 2.0 tile on an individual server. By spreading the load of one tile across a cluster of servers, however, it becomes possible to quantify the load that the cluster as a whole is capable of supporting. Benchmarking our cluster with VMmark 2.0 has shown that even modest clusters running vSphere can deliver an enormous amount of computing power to run complex multi-tier workloads.

Future Directions In this study, we scaled out VMmark 2.0 on a four-host entry-level cluster to measure performance scaling and the maximum supported number of tiles. This put a much higher load onto the cluster than might be typical for a small or medium business so that businesses can confidently deploy their application workloads. An alternate experiment would be to run fewer tiles while measuring the performance of other enterprise-level features, such as VMware High Availability. This ability to benchmark the cloud in many different ways is one benefit of having a well-designed multi-host benchmark. Keep watching this blog for more interesting studies in benchmarking the cloud with VMmark 2.0.

Oracle Real Application Clusters (RAC) is used to run critical databases with stringent performance requirements. A series of tests recently were run in the VMware performance lab to determine how an Oracle RAC database performs when running on vSphere. The test results showed that the application performed within 11 to 13 percent of physical when running in a virtualized environment.

Configuration

Two servers were used for both physical and virtual tests. Two Dell PowerEdge R710s with 2x Intel Xeon x5680 six-core processors and 96GB of RAM were connected via Fibre Channel to a NetApp FAS6030 array. The servers were dual booted between Red Hat Enterprise Linux 5.5 and vSphere ESXi 4.1. Each server was connected via three gigabit Ethernet NICs to a shared switch. One NIC was used for the public network and the other two were used for interconnect and cluster traffic.

The NetApp storage array had a total of 112 10K RPM 274GB Fibre Channel disks. Two 200GB LUNs, backed by a total of 80 disks, were used to create a data volume in Oracle ASM. Each data LUN was backed by a 40 disk RAID DP aggregate on the storage array. A 100GB log LUN was created on another volume that was backed by a 26 disk RAID DP aggregate. An additional small 2GB LUN was created to be used as the voting disk for the RAC cluster.

Each VM was configured with 32GB of RAM, three VMXNET3 virtual NICs, and a PVSCSI adapter for all the LUNs used except the OS disk. In order for the VMs to be able to share disks with physical hosts, it was necessary to mount the disks as RDMs and put the virtual SCSI adapter into physical compatibility mode. Additionally, to achieve the best performance for the Oracle RAC interconnect, the VMXNET3 NICs were configured with ethernetX.intrmode =1 in the vmx file. This option is a work around for an ESX performance bug that is specific to RHEL 5.5 VMs and to extremely latency sensitive workloads. The additional configuration option is no longer needed starting with ESX 4.1u1 because the bug is fixed starting with that version.

﻿A four node Oracle RAC cluster was created with two virtual nodes and two physical nodes. The virtual nodes were hosted on a third server when the two servers used for testing were booted to the native RHEL environment. RHEL 5.5 x64 and Oracle 11gR2 were installed on all nodes. During tests the two servers used for testing were booted either to native RHEL or ESX for the physical or virtual tests respectively. This meant that only the two virtual nodes or the two native nodes were powered on during a physical or virtual test. The diagrams below show the same test environment when setup for the two node physical or virtual test.

Physical Test Diagram:

Virtual Test Diagram:

Testing

The servers used in testing have a total of 12 physical cores and 24 logical threads if hyperthreading is enabled. The maximum number of vCPUs per VM supported by ESXi 4.1 is eight. This made it necessary to limit the physical server to a smaller number of cores to enable a performance comparison. Using the BIOS settings of the server, hyperthreading was disabled and the number of cores limited to two and four per socket. This resulted in four and eight core physical server configurations that were compared with VM configurations of four and eight vCPUs. Limiting the physical server configurations was only done to enable a direct performance comparison and is clearly not a good way to configure a system for performance normally.

Open source DVD Store 2.1 was used as the workload for the test. DVD Store is an OLTP database workload that simulates customers logging on, browsing, and purchasing DVDs from an online store. It includes database build scripts, load files, and driver programs. For these tests, the database driver was used to directly load the database without a need to have the Web tier installed. Using the new DVD Store 2.1 functionality, two custom-size databases of 50GB each with a 12GB SGA were created as two different instances named DS2 and DS2B. Both instances were running on both nodes of the cluster and were accessed equally on each node.

﻿Results

﻿Running an equal amount of load against each instance on each node was done with both the four CPU and eight CPU test cases. DS2 and DS2B instances spanned all nodes and were actively used on all nodes. An equal amount of threads were connected for each instance on each node. The amount of work was scaled up with the number of processors: twice as many DVD Store driver threads were used in the eight CPU case as compared with the four CPU case. For example, a total of 40 threads were running against node one in the four CPU test with 20 accessing DS2 and 20 accessing DS2B. Another 40 threads were accessing DS2 and DS2B on node two at the same time during that test. CPU utilization of the physical hosts and VMs were above 95% in all tests. Results are reported in terms of Orders Per Minute (OPM) and Average Response Time (RT) in milliseconds.

In both the OPM and RT measurements, the virtual RAC performance was within 11 to 13 percent of the physical RAC performance. In an intensive test running on Oracle RAC, the CPU, disk, and network were heavily utilized, but virtual performance was close to native performance. This result removes a barrier from considering virtualizing one of the more performance-intensive tier-one applications in the datacenter.

This is the third
part in an ongoing series on exploring performance issues of virtualizing HPC
applications. In the first part,
we described the setup and considered pure memory bandwidth using Stream. Thesecond part
considered the effect of network latency in a scientific application (NAMD) that ran across several virtual
machines. Here we look at two of the tests in the HPC Challenge
(HPCC) suite: StarRandomAccess and HPL.
While certainly not spanning all possible memory access patterns found in HPC apps, these
two tests are very different from each other and should help to give bounds on
virtualization overhead related to these patterns.

Virtualization
adds indirection to memory page table mappings: in addition to the logical-to-physical
mappings maintained by the OS (either native or in a VM), the hypervisor must
maintain guest physical-to-machine mappings. A straightforward implementation
of both mappings in software would result in enormous overhead. Prior to the
introduction of hardware MMU features in Intel (EPT) and AMD (RVI) processors, the
performance problem was solved through the use of “shadow” page tables. These
collapsed the two mappings to one so that the processor TLB cache could be used efficiently;
however, updating shadow page tables is expensive. With EPT and RVI, both
mappings are cached in the TLB, eliminating the need for shadow page tables. The
trade-off is that a TLB miss can be expensive: the cost is not just double the
cost of a miss in a conventional TLB; it is the square of the number of
steps in the TLB page walk. This cost can be reduced by using large memory
pages (2MB in x86_64) which typically need four steps in the TLB, rather than
small pages (4KB) which need five. This overview is highly simplified; see the
performance RVI andEPT
whitepapers for much more detail about MMU virtualization, as well as results
from several benchmarks representing enterprise applications. Here we extend the
EPT paper to HPC apps running on a current version of vSphere.

Although
there are certainly exceptions, two memory characteristics are common to HPC
applications: a general lack of page table manipulation, and heavy use of
memory itself. Memory is allocated once (along with the associated page tables)
and used for a long time. This use can either be dominated by sequential
accesses (running through an array), or by random accesses. The latter will put
more stress on the TLB. Common enterprise apps are often the opposite: much heavier page
table activity but lighter memory usage. Thus HPC apps do not benefit much from
the elimination of shadow page tables (this alone made many enterprise apps run
close to native performance as shown in the above papers), but may be sensitive
to the costs of TLB misses.

These points
are illustrated by two tests from the HPCC suite. StarRandomAccess is a
relatively simple microbenchmark that continuously accesses random memory
addresses. HPL is a standard floating-point linear algebra benchmark that
accesses memory more sequentially. For these tests, version 1.4.1 of HPCC was
used on RHEL 5.5 x86_64. Hyper-threading was disabled in the BIOS and all work
was limited to a single socket (automatically in the virtual cases and forced
with numactl for native). In this way, the effects of differences between native
and virtual in how HT and NUMA are treated were eliminated.
For virtual, a 4-vCPU VM with 22GB was used on a lab version (build 294208) of
ESX 4.1. The relevant HPCC parameters are N=40000, NB=100,P=Q=2, and np=4.
These values ensure all the CPU resources and nearly all the available memory
of one socket was consumed, thereby minimizing memory cache effects. The
hardware is the same as in the first part of this series. In particular, Xeon
X5570 processors with EPT are used.

Throughput
results for StarRandomAccess are shown in Table 1. The metric GUP/s is billions
of updates per second, a measure of memory bandwidth. Small/large pages refers
to memory allocation in the OS and application. For virtual, ESX always backs
guest memory with large pages, if possible (as it is here). The default case
(EPT enabled, small pages in the guest) achieves only about 85% of native
throughput. For an application with essentially no I/O or privileged
instructions that require special handling by the hypervisor, this is
surprisingly poor at first glance. However, this is a direct result of the
hardware architecture needed to avoid shadow page tables. Disabling EPT results
in near-native performance because, now, the TLB costs are essentially the same as for
native and the software MMU costs are minimal. TLB costs are still substantial
as seen by the effect of using large pages in native and the guest OS: more
than doubling of the performance. The virtualization overhead is reduced to
manageable levels, although there is still a 2% benefit from disabling EPT.

Table
1. StarRandomAccess throughput, GUP/s (ratio to native)

Native

Virtual

EPT on

EPT off

Small pages

0.01842

0.01561 (0.848)

0.01811 (0.983)

Large pages

0.03956

0.03805 (0.962)

0.03900 (0.986)

Table 2
shows throughput results for HPL. The metric Gflops/s is billions of floating
point operations per second. Memory is largely accessed sequentially, greatly
reducing the stress on the TLB and the effect of large pages. Large pages
improve virtual performance by 4%, but improve native performance by less than 2%.
Disabling EPT improves virtual performance by only 0.5%. It is not clear why
virtual is slightly faster than native in the large pages case; this will be
investigated further.

Table 2. HPL throughput, Gflop/s (ratio to native)

Native

Virtual

EPT on

EPT off

Small pages

37.04

36.04 (0.973)

36.22 (0.978)

Large pages

37.74

38.24 (1.013)

38.42 (1.018)

While hardware MMU virtualization with Intel EPT and AMD RVI has been a huge benefit for many applications, these test results
support the expectation that the benefit for HPC apps is smaller, and can even increase overhead in some cases. However, the example shown
here where the latter is true is a microbenchmark that focuses on the worst case for this technology. Most HPC apps will not have so
many random memory accesses, so the effect of EPT is likely to be small.