Enhancement of the resxtop utility— vSphere 4.0 U2 includes an enhancement of the performance monitoring utility, resxtop. The resxtop utility now provides visibility into the performance of NFS datastores in that it displays the following statistics for NFS datastores: Reads/s, writes/s, MBreads/s, MBwrtn/s, cmds/s, GAVG/s (guest latency).

Additional Guest Operating System Support— ESX/ESXi 4.0 Update 2 adds support for Ubuntu 10.04. For a complete list of supported guest operating systems with this release, see the VMware Compatibility Guide.

Resolved Issues– In addition, this release delivers a number of bug fixes that have been documented in theResolved Issues section.

In September 2009 we predicted that average 8GB DIMM prices (DDR2 and DDR3) would reach $565/stick by year end (with DDR3 being higher than DDR2) and at now we’re seeing the reversal of fortunes for DDR2. At year end, the average price for benchmark DDR2/DDR3 was $591 retail, with promotional pricing pushing that below$550 as predicted. Today, we’re seeing DDR3 begin to overtake DDR2 in the 8GB ECC category, dropping below $510/stick, while DDR2 climbs to $550/stick (promotional, on $625/stick retail.)

In 4GB ECC configurations, DDR2 enjoys only a slight retail advantage (13%) while promotional pricing (likely due to inventory reduction initiatives) are providing a bit better value short term. However, the price gap is only 1/2 the power gap, with DDR3 delivering a greater than 35% reduction in power over its DDR2 equivalent (about $1.25/year/stick at $0.10/kWh). The honeymoon is almost over for DDR2.

SOLORI’s Take: With strong DDR3 demand and short-falls in DDR2 supply (according to DRAMeXchange), the only thing keeping DDR3 prices above DDR2 at this point is demand and inventory. As Q2/2010 introduces a rush of new workstation and server products based on DDR3 systems, the DRAM production ramp will eventually stabilize demand somewhere towards the end of Q3/2010. Meanwhile, technology companies like VMware, Microsoft, Intel and AMD are betting on new infrastructure spending on operating systems, virtualization and hardware refresh to drive-up economic market factors. If the global economic crisis deepens, this anticipated spending spree could be short-lived and its impact shallow.

1. Unlike previous launches, AMD is planning to have “boots on the ground” this time with vendors and supply alignments in place to be able to ship product against anticipated demand. While it is now well known that Magny-Cours has been shipping to certain OEM and institutional customers for some time, our guess is that 2000/8000 series 6-core HE series have been hard to come by for a reason – and that reason has 12-cores not 6;

Obviously the big topic was the new AMD Opteron™ 6000 Series platforms that will be launching very soon. We had plenty of party favors – everyone walked home with a new 12-core AMD Opteron 6100 Series processor, code name “Magny-Cours”.

– Fruehe on AMD’s pending launch

2. Timing is right! With Intel’s Nehalem-EX 8-core and Core i7/Nehalem-EP 6-core being demoed about, there is more pressure than ever for AMD to step-up with a competitive player. Likewise, DDR3 is neck-and-neck with DDR2 in affordability and way ahead with low-power variants that more than compensate for power-hungry CPU profiles. AMD needs to deliver mainstream performance in 24-cores and 96GB DRAM within the power envelope of 12-cores and 64GB to be a player. With 1.35V DDR3 parts paired to better power efficiency in the 6100, this could be a possibility;

We demonstrated a benchmark running on two servers, one based on the Six-Core AMD Opteron processor codenamed “Istanbul,” and one 12-core “Magny-Cours”-based platform. You would have seen that the power consumption for the two is about the same at each utilization level. However, there is one area where there was a big difference – at idle. The “Magny-Cours”-based platform was actually lower!

– AMD’s Fruehe on Opteron 6100’s power consumption

3. Performance in scaled virtualization matters – raw single-threaded performance is secondary. In virtual architectures, clusters of systems must perform as one in an orchestrated ballet of performance and efficiency seeking. For some clusters, dynamic load migration to favour power consumption is a priority – relying on solid power efficiency under high load conditions. For other clusters, workload is spread to maximize performance available to key workloads – relying on solid power efficiency under generally light loads. For many environments, multi-generational hardware will be commonplace and AMD is counting on its wider range of migration compatibility to hold-on to customers that have not yet jumped ship for Intel’s Nehalem-EP/EX.

“We demonstrated Microsoft Hyper-V running on two different servers, one based on a Quad-Core AMD Opteron processor codenamed “Barcelona” (circa 2007) and a brand new “Magny-Cours”-based system. …companies might have problems moving a 2010 VM to a 2007 server without limiting the VM features. (For example, in order to move a virtual machine from an Intel “Nehalem”-based system to a “Harpertown” [or earlier] platform, the customer must not enable nested paging in the “Nehalem” virtual machine, which can reduce the overall performance of the VM.)”

SOLORI’s Take: It would appear that Magny-Cours has more under the MCM hood than a pair of Istanbul processors (as previously charged.) To manage better idle performance and constant power performance in spite of a two-to-one core ratio and similar 45nm process, AMD’s process and feature set must include much better power management as well, however, core speed is not one of them. With the standard “Maranello” 6100 series coming in at 1.9, 2.1 and 2.2 GHz with an HE variant at 1.7GHz and SE version running at 2.3GHz, finding parity in an existing cluster of 2.4, 2.6 and 2.8 GHz six-core servers may be difficult. Still, Maranello/G34 CPUs will be at 85, 115 and 140W TDP.

That said, Fruehe has a point on virtualization platform deployment and processor speed: it is not necessary to trim-out an entire farm with top-bin parts – only a small portion of the cluster needs to operate with top-band performance marks. The rest of the market is looking for predictable performance, scalability and power efficiency per thread. While SMT makes a good run at efficiency per thread, it does so at the expense of predictable performance. Here’s hoping that AMD’s C1E (or whatever their power-sipping special sauce will be called) does nothing to interfere with predictable performance…

As we’ve said before, memory capacity and bandwidth (as a function of system power and core/thread capacity) are key factors in a CPU’s viability in a virtualization stack. With 12 DIMM slots per CPU (3-DPC, 4-channel), AMD inherits an enviable position over Intel’s current line-up of 2P solutions by being able to offer 50% more memory per cluster node without resorting to 8GB DIMMs. That said, it’s up to OEM’s to deliver rack server designs that feature 12 DIMM per CPU and not hold-back with only 8 DIMM variants. In the blade and 1/2-size market, cramming 8 DIMM per board (effectively 1-DPC for 2P Magny-Cours) can be a challenge let alone 24 DIMMs! Perhaps we’ll see single-socket blades with 12 DIMMs (12-cores, 48/96GB DDR3) or 2P blades with only one 12 DIMM memory bank (one-hop, NUMA) in the short term.

SOLORI’s 2nd Take: It makes sense that AMD would showcase their leading OEM partners because their success will be determined on what those OEM’s bring to market. With VDI finally poised to make a big market impact, we’d expect to see the first systems delivered with 2-DPC configurations (8 DIMM per CPU, economically 2.5GB/core) which could meet both VDI and HPC segments equally. However, with Window7 gaining momentum, what’s good for HPC might not cut it for long in the VDI segment where expectations of 4-6 VM’s per core at 1-2GB/VM are mounting.

Besides the launch date, what wasn’t said was who these OEM’s are and how many systems they’ll be delivering at launch. Whoever they are, they need to be (1) financially stronger than AMD, (2) in an aggressive marketing position with respect to today’s key growth market (server and desktop virtualization), and (3) willing to put AMD-based products “above the fold” on their marketing and e-commerce initiatives. AMD needs to “represent” in a big way before a tide of new technologies makes them yesterday’s news. We have high hopes that AMD’s recent “perfect” execution streak will continue.

Fujitsu’s RX300 S5 rack server takes the top spot in VMware’s VMmark for 8-core systems today with a score of 25.16@17 tiles. Loaded with two of Intel’s top-bin 3.33GHz, 130W Nehalem-EP processors (W5590, turbo to 3.6GHz per core) and 96GB of DDR3-1333 R-ECC memory, the RX300 bested the former champ – the HP ProLiant BL490c G6 blade – by only about 2.5%.

With 17 tiles and 102 virtual machines on a single 2U box, the RX300 S5 demonstrates precisely how well vSphere scales on today’s x86 commodity platforms. It also appears to demonstrate both the value and the limits of Intel’s “turbo mode” in its top-bin Nehalem-EP processors – especially in the virtualization use case – we’ll get to that later. In any case, the resulting equation is:

More * (Threads + Memory + I/O) = Dense Virtualization

We could have added “higher execution rates” to that equation, however, virtualization is a scale-out applications where threads, memory pool and I/O capabilities dominate the capacity equation – not clock speed. Adding 50% more clock provides less virtualization gains than adding 50% more cores, and reducing memory and context latency likewise provides better gains that simply upping the clock speed. That’s why a dual quad-core Nehalem 2.6GHz processor will crush a quad dual-core 3.5GHz (ill-fated) Tulsa.

Speaking of Tulsa, unlike Tulsa’s rather anaemic first-generation hyper-threading, Intel’s improved SMT in Nehalem “virtually” adds more core “power” to the Xeon by contributing up to 100% more thread capacity. This is demonstrated by Nehalem-EP’s 2 tiles per core contributions to VMmark where AMD’s Istanbul quad-core provides only 1 tile per core. But exactly what is a VMmark tile and how does core versus thread play into the result?

The Illustrated VMmark "Tile" Load

As you can see, a “VMmark Tile” – or just “tile” for short – is composed of 6 virtual machines, half running Windows, half running SUSE Linux. Likewise, half of the tiles are running in 64-bit mode while the other half runs in 32-bit mode. As a whole, the tile is composed of 10 virtual CPUs, 5GB of RAM and 62GB of storage. Looking at how the parts contribute to the whole, the tile is relatively balanced:

Operating System / Mode

32-bit

64-bit

Memory

vCPU

Disk

Windows Server 2003 R2

67%

33%

45%

50%

58%

SUSE Linux Enterprise Server 10 SP2

33%

67%

55%

50%

42%

32-bit

50%

N/A

30%

40%

58%

64-bit

N/A

50%

70%

60%

42%

If we stop here and accept that today’s best x86 processors from AMD and Intel are capable of providing 1 tile for each thread, we can look at the thread count and calculate the number of tiles and resulting memory requirement. While that sounds like a good “rule of thumb” approach, it ignores specific use case scenarios where synthetic threads (like HT and SMT) do not scale linearly like core threads do where SMT accounts for only about 12% gains over single-threaded core, clock-for-clock. For this reason, processors from AMD and Intel in 2010 will feature more cores – 12 for AMD and 8 for Intel in their Magny-Cours and Nehalem-EX (aka “Beckton”), respectively.

Learning from the Master

If we want to gather some information about a specific field, we consult an expert, right? Judging from the results, Fujitsu’s latest dual-processor entry has definitely earned the title ‘Master of VMmark” in 2P systems – at least for now. So instead of the usual VMmark $/VM analysis (which are well established for recent VMmark entries), let’s look at the solution profile and try to glean some nuggets to take back to our data centers.

It’s Not About Raw Speed

First, we’ve noted that the processor used is not Intel’s standard “rack server” fare, but the more workstation oriented W-series Nehalem at 130W TDP. With “turbo mode” active, this CPU is capable of driving the 3.33GHz core – on a per-core basis – up to 3.6GHz. Since we’re seeing only a 2.5% improvement in overall score versus the ProLiant blade at 2.93GHz, we can extrapolate that the 2.93GHz X5570 Xeon is spending a lot of time at 3.33GHz – its “turbo” speed – while the power-hungry W5590 spends little time at 3.6GHz. How can we say this? Looking at the tile ratio as a function of the clock speed.

We know that the X5570 can run up to 3.33GHz, per core, according to thermal conditions on the chip. With proper cooling, this could mean up to 100% of the time (sorry, Google). Assuming for a moment that this is the case in the HP test environment (and there is sufficient cause to think so) then the ratio of the tile score to tile count and CPU frequency is 0.433 (24.54/17/3.33). If we examine the same ratio for the W5590, assuming the clock speed of 3.33GHz, we get 0.444 – a difference of 2.5%, or the contribution of “turbo” in the W5590. Likewise, if you back-figure the “apparent speed” of the X5570 using the ratio of the clock-locked W5590, you arrive at 3.25GHz for the W5570 (an 11% gain over base clock). In either case, it is clear that “turbo” is a better value at the low-end of the Nehalem spectrum as there isn’t enough thermal headroom for it to work well for the W-series.

VMmark Equals Meager Network Use

Second, we’re not seeing “fancy” networking tricks out of VMmark submissions. In the past, we’ve commented on the use of “consumer grade” switches in VMmark tests. For this reason, we can consider VMmark’s I/O dependency as related almost exclusively to storage. With respect to networking, the Fujitsu team simply interfaced three 1Gbps network adapter ports to the internal switch of the blade enclosure used to run the client-side load suite and ran with the test. Here’s what that looks like:

Note that the network interfaces used for the VMmark trial are not from the on-board i82575EB network controller but from the PCI-Express quad-port adapter using its older cousin – the i82571EB. What is key here is that VMmark is tied to network performance issues, and it is more likely that additional network ports might increase the likelihood of IRQ sharing and reduced performance more so than the “optimization” of network flows.

Keeping Storage “Simple”

Third, Fujitsu’s approach to storage is elegantly simple: several “inexpensive” arrays with intelligent LUN allocation. For this, Fujistu employed eight of its ETERNUS DX80 Disk Storage Systems with 7 additional storage shelves for a total of 172 working disks and 23 LUNs. For simplicity, Fujistu used a pair of 8Gbps FC ports to feed ESX and at least one port per DX80 – all connected through a Brocade 5100 fabric switch. The result looked something like this:

And yes, the ESX server is configured to boot from SAN, using no locally attached storage. Note that the virtual machine configuration files, VM swap and ESX boot/swap are contained in a separate DX80 system. This “non-default” approach allows the working VMDKs of the virtual machines to be isolated – from a storage perspective – from the swap file overhead, about 5GB per tile. Again, this is a benchmark scenario, not an enterprise deployment, so trade-offs are in favour of performance, not CAPEX or OPEX.

Even if the DX80 solution falls into the $1K/TB range, to say that this approach to storage is “economic” requires a deeper look. At 33 rack units for the solution – including the FC switch but not including the blade chassis – this configuration has a hefty datacenter footprint. In contrast to the old-school server/blade approach, 1 rack at 3 servers per U is a huge savings over the 2 racks of blades or 3 racks of 1U rack servers. Had each of those servers of blades had a mirror pair, we’d be talking about 200+ disks spinning in those racks versus the 172 disks in the ETERNUS arrays, so that still represents a savings of 15.7% in storage-related power/space.

When will storage catch up?

Compared to a 98% reduction in network ports, a 30-80% reduction server/storage CAPEX (based on $1K/TB SAN), a 50-75% reduction in overall datacenter footprint, why is a 15% reduction in datacenter storage footprint acceptable? After all, storage – in the Fujitsu VMmark case – now represents 94% of the datacenter footprint. Even if the load were less aggressively spread across five ESX servers (a conservative 20:1 loading), the amount of space taken by storage only falls to 75%.

How can storage catch up to virtualization densities. First, with 2.5″ SAS drives, a bank of 172 disks can be made to occupy only 16U with very strong performance. This drops storage to only 60% of the datacenter footprint – 10U for hypervisor, 16U for storage, 26U total for this example. Moving from 3.5″ drives to 2.5″ drives takes care of the physical scaling issue with acceptable returns, but results in only minimal gains in terms of power savings.

Saving power in storage platforms is not going to be achieved by simply shrinking disk drives – shrinking the NUMBER of disks required per “effective” LUN is what’s necessary to overcome the power demands of modern, high-performance storage. This is where non-traditional technology like FLASH/SSD is being applied to improve performance while utilizing fewer disks and proportionately less power. For example, instead of dedicating disks on a per LUN basis, carving LUNs out of disk pools accelerated by FLASH (a hybrid storage pool) can result in a 30-40% reduction in disk count – when applied properly – and that means 30-40% reduction in datacenter space and power utilization.

Lessons Learned

Here are our “take aways” from the Fujitsu VMmark case:

1) Top-bin performance is at the losing end of diminishing returns. Unless your budget can accommodate this fact, purchasing decisions about virtualization compute platforms need to be aligned with $/VM within an acceptable performance envelope. When shopping CPU, make sure the top-bin’s “little brother” has the same architecture and feature set and go with the unit priced for the mainstream. (Don’t forget to factor memory density into the equation…) Regardless, try to stick within a $190-280/VM equipment budget for your hypervisor hardware and shoot for a 20-to-1 consolidation ratio (that’s at least $3,800-5,600 per server/blade).

2) While networking is not important to VMmark, this is likely not the case for most enterprise applications. Therefore, VMmark is not a good comparison case for your network-heavy applications. Also, adding more network ports increases capacity and redundancy but does so at the risk of IRQ-sharing (ESX, not ESXi) problems, not to mention the additional cost/number of network switching ports. This is where we think 10GE will significantly change the equation in 2010. Remember to add up the total number of in use ports – including out-of-band management – when factoring in switch density. For net new instalments, look for a switch that provides 10GE/SR or 10GE/CX4 options and go with !0GE/SR if power savings are driving your solution.

3) Storage should be simple, easy to manage, cheap (relatively speaking), dense and low-power. To meet these goals, look for storage technologies that utilize FLASH memory, tiered spindle types, smart block caching and other approaches to limit spindle count without sacrificing performance. Remember to factor in at least the cost of DAS when approximating your storage budget – about $150/VM in simple consolidation cases and $750/VM for more mission critical applications (that’s a range of $9,000-45,000 for a 3-server virtualization stack). The economies in managed storage come chiefly from the administration of the storage, but try to identify storage solutions that reduce datacenter footprint including both rack space and power consumption. Here’s where offerings from Sun and NexentaStor are showing real gains.

We’d like to see VMware update VMmark to include system power specifications so we can better gage – from the sidelines – what solution stack(s) perform according to our needs. VMmark served its purpose by giving the community a standard from which different platforms could be compared in terms of the resultant performance. With the world’s eyes on power consumption and the ecological impact of datacenter choices, adding a “power utilization component” to the “server-side” of the VMmark test would not be that significant of a “tweak.” Here’s how we think it can be done:

Require power consumption of the server/VMmark related components be recorded, including:

Power delivered to the test harness platforms, client load machines, etc. can be ignored;

Power measurements should be recorded at the following times:

All equipment off (validation check);

Start-up;

Single tile load;

100% tile capacity;

75% tile capacity;

50% tile capacity;

Power measurements should be recorded using a time-power data-logger with readings recorded as 5-minute averages;

Notations should be made concerning “cache warm-up” intervals, if applicable, where “cache optimized” storage is used.

Why is this important? In the wake of the VCE announcement, solution stacks like VCE need to be measured against each other in an easy to “consume” way. Is VCE the best platform versus a component solution provided by your local VMware integrator? Given that the differentiated VCE components are chiefly UCS, Cisco switching and EMC storage, it will be helpful to have a testing platform that can better differentiate “packaged solutions” instead of uncorrelated vendor “propaganda.”

Let us know what your thoughts are on the subject, either on Twitter or on our blog…

KVM is Red Hat’s new hypervisor that leverages the Linux kernel to accelerate support for hardware and capabilities. It was Red Hat and AMD that first demonstrated live migration between AMD and Intel-based hypervisors using KVM late last year – then somewhat of a “Holy Grail” of hypervisor feats. With nearly a year of improvements and integration into their Red Hat Enterprise Server and Fedora “free and open source” offerings, Red Hat is almost ready to strike-out in a commercially viable way.

Microsoft now officially supports the following Red Hat guest operating systems in Hyper-V:

The goal of the announcement and associated agreements between Red Hat and Microsoft was to enable a fully supported virtualization infrastructure for enterprises with Red Hat and Microsoft assets. As such, Microsoft and Red Hat are committed to supporting their respective products whether the hypervisor environment is all Red Hat, all Hyper-V or totally heterogeneous – mixing Red Hat KVM and Microsoft Hyper-V as necessary.

“With this announcement, Red Hat and Microsoft are ensuring their customers can resolve any issues related to Microsoft Windows on Red Hat Enterprise Virtualization, and Red Hat Enterprise Linux operating on Microsoft Hyper-V, regardless of whether the problem is related to the operating system or the virtualization implementation.”

Many in the industry cite Red Hat’s adoption of KVM as a step backwards [from Xen] requiring the re-development of significant amount of support code. However, Red Hat’s use of libvirt as a common management API has allowed the change to happen much more rapidly that critics assumptions had allowed. At Red Hat Summit 2009, key Red Hat officials were keen to point out just how tasty their “dog food” is:

Tim Burke, Red Hat’s vice president of engineering, said that Red Hat already runs much of its own infrastructure, including mail servers and file servers, on KVM, and is working hard to promote KVM with key original equipment manufacturer partners and vendors.

And Red Hat CTO Brian Stevens pointed out in his Summit keynote that with KVM inside the Linux kernel, Red Hat customers will no longer have to choose which applications to virtualize; virtualization will be everywhere and the tools to manage applications will be the same as those used to manage virtualized guests.

For system integrators and virtual infrastructure practices, Red Hat’s play is creating opportunities for differentiation. With a focus on light-weight, high-performance, I/O-driven virtualization applications and no need to support years-old established processes that are dragging on Xen and VMware, KVM stands to leap-frog the competition in the short term.

SOLORI’s Take: This news is good for all Red Hat and Microsoft customers alike. Indeed, it shows that Microsoft realizes that its licenses are being sold into the enterprise whether or not they run on physical hardware. With 20+:1 consolidation ratios now common, that represents a 5:1 license to hardware sale for Microsoft, regardless of the hypervisor. With KVM’s demonstrated CPU agnostic migration capabilities, this opens the door to an even more diverse virtualization infrastructure than ever before.

On the Red Hat side, it demonstrates how rapidly Red Hat has matured its offering following the shift to KVM earlier this year. While KVM is new to Red Hat, it is not new to Linux or aggressive early adopters since being added to the Linux kernel as of 2.6.20 back in September of 2007. With support already in active projects like ConVirt (VM life cycle management), OpenNebula (cloud administration tools), Ganeti, and Enomaly’s Elastic Computing Platform, the game of catch-up for Red Hat and KVM is very likely to be a short one.

Most importantly for virtualization systems architects is how the vCPU scheduling affects “measured” performance. The telling piece comes from the difference in comparison results where vCPU scheduling is equalized:

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009.

When comparing the results, De Gelas hits on the I/O factor which chiefly separates VMmark from vAPUS:

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570’s crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.

This is yet another issue that VMware architects struggle with in complex deployments. The latency in “Dunnington” is a huge contributor to its downfall and why the Penryn architecture was a dead-end. Combined with 8 additional threads in the 2P form factor, Nehalem delivers twice the number of hardware execution contexts than Shanghai, resulting in significant efficiencies for Nehalem where small working data sets are involved.

When larger sets are used – as in vAPUS – the Istanbul’s additional cores allows it to close the gap to within the clock speed difference of Nehalem (about 12%). In contrast to VMmark which implies a 3:2 advantage to Nehalem, the vAPUS results suggest a closer performance gap in more aggressive virtualization use cases.

AMD’s John Fruehe outlines AMD’s market approach for the new chipsets in his “AMD at Work” blog today. Based on the same basic logic/silicon, the SR5690, SR5670 and SR5650 all deliver PCI-E 2.0 and HT3.0 but at differing levels of power consumption and PCI Express lanes to their respective platforms. Paired with appropriate “power and speed” Opteron variant, these platforms offer system designers, virtualization architects and HPC vendors greater control over price-performance and power-performance constraints that drive their respective environments.

Popular Posts

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.