“moving a virtual machine over a network from one VirtualBox host to another, while the virtual machine is running. This works regardless of the host operating system that is running on the hosts: you can teleport virtual machines between Solaris and Mac hosts, for example.”

Teleportation operates like an in-place replacement of a VM’s facilities, requiring that the “target” host has a virtual machine in VirtualBox with exactly the same hardware settings as the “source” VM. The source and target VM’s must also share the same storage, etc. and must use either the same VirtualBox accessible iSCSI targets or some other network storage (NFS or SMB/CIFS) – and no snapshots.

“The hosts must have fairly similar CPUs. While VirtualBox can simulate some CPU features to a degree, this does not always work. Teleporting between Intel and AMD CPUs will probably fail with an error message.”

The recipe for teleportation begins on the target and is given in an example, leveraging VirtualBox’s VBoxManage command syntax:

VBoxManage modifyvm --teleporter on --teleporterport

On the source, the running virtual machine is modified according to the following:

VBoxManage controlvm teleport --host --port

For testing, same-host teleportation is allowed (source and target equal loopback). Obviously a ready and clean-up script would be involved to copy the settings to a target location, provide the teleport maintenance and clean-up the former VM configuration that is obsoleted in the teleportation. In the case of an error, the running VM stays running on the source host, and the target VM fails to initialize.

SOLORI’s Take: This represents the writing on the wall for VMware and vMotion. Perhaps the shift from VMotion to vMotion telegraphs the reduced value VMware already sees in the “now standard” feature. Adding vMotion to vSphere Essentials and Essentials Plus would garner a lot of adoption from the SMB market that is moving quickly to Hyper-V over Citrix and VMware. With VirtualBox’s obvious play in desktop virtualization – where minimalist live migration features would be less of a burden – VMware’s market could quickly become divided in 2010 with some crafty third-party integration along with open VDI. It’s a ways off, but the potential is there…

Given the recent release of VMware View 4.0, we though it would be handy to showcase the current state of the View “certified” HCL for “hardware” thin clients. As of November 30, 2009, the following hardware thin clients are “officially” on VMware’s HCL:

Devices not on this list may “work” with VMware View 4.0 but may not support all of View 4’s features. VMware addresses certified and compatible as follows:

Certified and Compatible Thin Clients:Certified – A thin client device listed against a particular VMware View release in the Certified For column has been tested by the thin client manufacturer against that specific VMware View release and includes a minimum set of features supported in that VMware View version.

Compatible – A thin client device certified against a specific VMware View release is compatible with previous and subsequent VMware View releases according to the compatibility guarantees published as part of that specific VMware View release (typically two major releases in both directions). However, a compatible thin client may not include all of the features of the newer VMware View release. Please refer to your VMware View Client documentation to determine which features are included.

Unlisted thin clients may embed VMware’s “software client” along with a more general purpose operating system to deliver View 4 compatibility. Support for this class of device may be restricted to the device vendor only. Likewise, thin clients that are compatible with earlier versions of View may support only a subset of View 4’s features. When in doubt, contact the thin client manufacturer before deploying with View 4.

NEC’s venerable Express5800/A1160 is back at the top VMmark chart, this time establishing the brand-new 64-core category with a score of 48.23@32 tiles – surpassing its 48-core 3rd place posting by over 30%. NEC’s new 16-socket, 64-core, 256GB “Dunnington” X7460 Xeon-based score represents a big jump in performance over its predecessor with a per tile ratio of 1.507 – up 6% from the 48-core ratio of 1.419.

At $500/core, NEC’s gambit may represent an expensive form of “core liposuction” but it was a necessary one to meet VMware’s “logical processor per host” limitation of 64. That’s right, currently VMware’s vSphere places a limit on logical processors based on the following formula:

CPU_Sockets X Cores_Per_Socket X Threads_Per_Core =< 64

According to VMware, the other 32 cores would have been “ignored” by vSphere had they been enabled. Since “ignored” is a nebulous term (aka “undefined”), NEC did the “scientific” thing by disabling 32 cores and calling the system a 64-core server. The win here: a net 6% improvement in performance per tile over the 6-core configuration – ostensibly from the reduced core loading on the 16MB of L3 cache per socket and reduction in memory bus contention.

Moving forward to 2010, what does this mean for vSphere hardware configurations in the wake of 8-core, 16-thread Intel Nehalem-EX and 12-core, 12-thread AMD Magny-Cours processors? With a 4-socket Magny-Cours system limitation, we won’t be seeing any VMmarks from the boys in green beyond 48-cores. Likewise, the boys in blue will be trapped by a VMware limitation (albeit, a somewhat arbitrary and artificial one) into a 4-socket, 64-thread (HT) configuration or an 8-socket, 64-core (HT-disabled) configuration for their Nehalem-EX platform – even if using the six-core variant of EX. Looks like VMware will need to lift the 64-LCPU embargo by Q2/2010 just to keep up.

Fujitsu’s RX300 S5 rack server takes the top spot in VMware’s VMmark for 8-core systems today with a score of 25.16@17 tiles. Loaded with two of Intel’s top-bin 3.33GHz, 130W Nehalem-EP processors (W5590, turbo to 3.6GHz per core) and 96GB of DDR3-1333 R-ECC memory, the RX300 bested the former champ – the HP ProLiant BL490c G6 blade – by only about 2.5%.

With 17 tiles and 102 virtual machines on a single 2U box, the RX300 S5 demonstrates precisely how well vSphere scales on today’s x86 commodity platforms. It also appears to demonstrate both the value and the limits of Intel’s “turbo mode” in its top-bin Nehalem-EP processors – especially in the virtualization use case – we’ll get to that later. In any case, the resulting equation is:

More * (Threads + Memory + I/O) = Dense Virtualization

We could have added “higher execution rates” to that equation, however, virtualization is a scale-out applications where threads, memory pool and I/O capabilities dominate the capacity equation – not clock speed. Adding 50% more clock provides less virtualization gains than adding 50% more cores, and reducing memory and context latency likewise provides better gains that simply upping the clock speed. That’s why a dual quad-core Nehalem 2.6GHz processor will crush a quad dual-core 3.5GHz (ill-fated) Tulsa.

Speaking of Tulsa, unlike Tulsa’s rather anaemic first-generation hyper-threading, Intel’s improved SMT in Nehalem “virtually” adds more core “power” to the Xeon by contributing up to 100% more thread capacity. This is demonstrated by Nehalem-EP’s 2 tiles per core contributions to VMmark where AMD’s Istanbul quad-core provides only 1 tile per core. But exactly what is a VMmark tile and how does core versus thread play into the result?

The Illustrated VMmark "Tile" Load

As you can see, a “VMmark Tile” – or just “tile” for short – is composed of 6 virtual machines, half running Windows, half running SUSE Linux. Likewise, half of the tiles are running in 64-bit mode while the other half runs in 32-bit mode. As a whole, the tile is composed of 10 virtual CPUs, 5GB of RAM and 62GB of storage. Looking at how the parts contribute to the whole, the tile is relatively balanced:

Operating System / Mode

32-bit

64-bit

Memory

vCPU

Disk

Windows Server 2003 R2

67%

33%

45%

50%

58%

SUSE Linux Enterprise Server 10 SP2

33%

67%

55%

50%

42%

32-bit

50%

N/A

30%

40%

58%

64-bit

N/A

50%

70%

60%

42%

If we stop here and accept that today’s best x86 processors from AMD and Intel are capable of providing 1 tile for each thread, we can look at the thread count and calculate the number of tiles and resulting memory requirement. While that sounds like a good “rule of thumb” approach, it ignores specific use case scenarios where synthetic threads (like HT and SMT) do not scale linearly like core threads do where SMT accounts for only about 12% gains over single-threaded core, clock-for-clock. For this reason, processors from AMD and Intel in 2010 will feature more cores – 12 for AMD and 8 for Intel in their Magny-Cours and Nehalem-EX (aka “Beckton”), respectively.

Learning from the Master

If we want to gather some information about a specific field, we consult an expert, right? Judging from the results, Fujitsu’s latest dual-processor entry has definitely earned the title ‘Master of VMmark” in 2P systems – at least for now. So instead of the usual VMmark $/VM analysis (which are well established for recent VMmark entries), let’s look at the solution profile and try to glean some nuggets to take back to our data centers.

It’s Not About Raw Speed

First, we’ve noted that the processor used is not Intel’s standard “rack server” fare, but the more workstation oriented W-series Nehalem at 130W TDP. With “turbo mode” active, this CPU is capable of driving the 3.33GHz core – on a per-core basis – up to 3.6GHz. Since we’re seeing only a 2.5% improvement in overall score versus the ProLiant blade at 2.93GHz, we can extrapolate that the 2.93GHz X5570 Xeon is spending a lot of time at 3.33GHz – its “turbo” speed – while the power-hungry W5590 spends little time at 3.6GHz. How can we say this? Looking at the tile ratio as a function of the clock speed.

We know that the X5570 can run up to 3.33GHz, per core, according to thermal conditions on the chip. With proper cooling, this could mean up to 100% of the time (sorry, Google). Assuming for a moment that this is the case in the HP test environment (and there is sufficient cause to think so) then the ratio of the tile score to tile count and CPU frequency is 0.433 (24.54/17/3.33). If we examine the same ratio for the W5590, assuming the clock speed of 3.33GHz, we get 0.444 – a difference of 2.5%, or the contribution of “turbo” in the W5590. Likewise, if you back-figure the “apparent speed” of the X5570 using the ratio of the clock-locked W5590, you arrive at 3.25GHz for the W5570 (an 11% gain over base clock). In either case, it is clear that “turbo” is a better value at the low-end of the Nehalem spectrum as there isn’t enough thermal headroom for it to work well for the W-series.

VMmark Equals Meager Network Use

Second, we’re not seeing “fancy” networking tricks out of VMmark submissions. In the past, we’ve commented on the use of “consumer grade” switches in VMmark tests. For this reason, we can consider VMmark’s I/O dependency as related almost exclusively to storage. With respect to networking, the Fujitsu team simply interfaced three 1Gbps network adapter ports to the internal switch of the blade enclosure used to run the client-side load suite and ran with the test. Here’s what that looks like:

Note that the network interfaces used for the VMmark trial are not from the on-board i82575EB network controller but from the PCI-Express quad-port adapter using its older cousin – the i82571EB. What is key here is that VMmark is tied to network performance issues, and it is more likely that additional network ports might increase the likelihood of IRQ sharing and reduced performance more so than the “optimization” of network flows.

Keeping Storage “Simple”

Third, Fujitsu’s approach to storage is elegantly simple: several “inexpensive” arrays with intelligent LUN allocation. For this, Fujistu employed eight of its ETERNUS DX80 Disk Storage Systems with 7 additional storage shelves for a total of 172 working disks and 23 LUNs. For simplicity, Fujistu used a pair of 8Gbps FC ports to feed ESX and at least one port per DX80 – all connected through a Brocade 5100 fabric switch. The result looked something like this:

And yes, the ESX server is configured to boot from SAN, using no locally attached storage. Note that the virtual machine configuration files, VM swap and ESX boot/swap are contained in a separate DX80 system. This “non-default” approach allows the working VMDKs of the virtual machines to be isolated – from a storage perspective – from the swap file overhead, about 5GB per tile. Again, this is a benchmark scenario, not an enterprise deployment, so trade-offs are in favour of performance, not CAPEX or OPEX.

Even if the DX80 solution falls into the $1K/TB range, to say that this approach to storage is “economic” requires a deeper look. At 33 rack units for the solution – including the FC switch but not including the blade chassis – this configuration has a hefty datacenter footprint. In contrast to the old-school server/blade approach, 1 rack at 3 servers per U is a huge savings over the 2 racks of blades or 3 racks of 1U rack servers. Had each of those servers of blades had a mirror pair, we’d be talking about 200+ disks spinning in those racks versus the 172 disks in the ETERNUS arrays, so that still represents a savings of 15.7% in storage-related power/space.

When will storage catch up?

Compared to a 98% reduction in network ports, a 30-80% reduction server/storage CAPEX (based on $1K/TB SAN), a 50-75% reduction in overall datacenter footprint, why is a 15% reduction in datacenter storage footprint acceptable? After all, storage – in the Fujitsu VMmark case – now represents 94% of the datacenter footprint. Even if the load were less aggressively spread across five ESX servers (a conservative 20:1 loading), the amount of space taken by storage only falls to 75%.

How can storage catch up to virtualization densities. First, with 2.5″ SAS drives, a bank of 172 disks can be made to occupy only 16U with very strong performance. This drops storage to only 60% of the datacenter footprint – 10U for hypervisor, 16U for storage, 26U total for this example. Moving from 3.5″ drives to 2.5″ drives takes care of the physical scaling issue with acceptable returns, but results in only minimal gains in terms of power savings.

Saving power in storage platforms is not going to be achieved by simply shrinking disk drives – shrinking the NUMBER of disks required per “effective” LUN is what’s necessary to overcome the power demands of modern, high-performance storage. This is where non-traditional technology like FLASH/SSD is being applied to improve performance while utilizing fewer disks and proportionately less power. For example, instead of dedicating disks on a per LUN basis, carving LUNs out of disk pools accelerated by FLASH (a hybrid storage pool) can result in a 30-40% reduction in disk count – when applied properly – and that means 30-40% reduction in datacenter space and power utilization.

Lessons Learned

Here are our “take aways” from the Fujitsu VMmark case:

1) Top-bin performance is at the losing end of diminishing returns. Unless your budget can accommodate this fact, purchasing decisions about virtualization compute platforms need to be aligned with $/VM within an acceptable performance envelope. When shopping CPU, make sure the top-bin’s “little brother” has the same architecture and feature set and go with the unit priced for the mainstream. (Don’t forget to factor memory density into the equation…) Regardless, try to stick within a $190-280/VM equipment budget for your hypervisor hardware and shoot for a 20-to-1 consolidation ratio (that’s at least $3,800-5,600 per server/blade).

2) While networking is not important to VMmark, this is likely not the case for most enterprise applications. Therefore, VMmark is not a good comparison case for your network-heavy applications. Also, adding more network ports increases capacity and redundancy but does so at the risk of IRQ-sharing (ESX, not ESXi) problems, not to mention the additional cost/number of network switching ports. This is where we think 10GE will significantly change the equation in 2010. Remember to add up the total number of in use ports – including out-of-band management – when factoring in switch density. For net new instalments, look for a switch that provides 10GE/SR or 10GE/CX4 options and go with !0GE/SR if power savings are driving your solution.

3) Storage should be simple, easy to manage, cheap (relatively speaking), dense and low-power. To meet these goals, look for storage technologies that utilize FLASH memory, tiered spindle types, smart block caching and other approaches to limit spindle count without sacrificing performance. Remember to factor in at least the cost of DAS when approximating your storage budget – about $150/VM in simple consolidation cases and $750/VM for more mission critical applications (that’s a range of $9,000-45,000 for a 3-server virtualization stack). The economies in managed storage come chiefly from the administration of the storage, but try to identify storage solutions that reduce datacenter footprint including both rack space and power consumption. Here’s where offerings from Sun and NexentaStor are showing real gains.

We’d like to see VMware update VMmark to include system power specifications so we can better gage – from the sidelines – what solution stack(s) perform according to our needs. VMmark served its purpose by giving the community a standard from which different platforms could be compared in terms of the resultant performance. With the world’s eyes on power consumption and the ecological impact of datacenter choices, adding a “power utilization component” to the “server-side” of the VMmark test would not be that significant of a “tweak.” Here’s how we think it can be done:

Require power consumption of the server/VMmark related components be recorded, including:

Power delivered to the test harness platforms, client load machines, etc. can be ignored;

Power measurements should be recorded at the following times:

All equipment off (validation check);

Start-up;

Single tile load;

100% tile capacity;

75% tile capacity;

50% tile capacity;

Power measurements should be recorded using a time-power data-logger with readings recorded as 5-minute averages;

Notations should be made concerning “cache warm-up” intervals, if applicable, where “cache optimized” storage is used.

Why is this important? In the wake of the VCE announcement, solution stacks like VCE need to be measured against each other in an easy to “consume” way. Is VCE the best platform versus a component solution provided by your local VMware integrator? Given that the differentiated VCE components are chiefly UCS, Cisco switching and EMC storage, it will be helpful to have a testing platform that can better differentiate “packaged solutions” instead of uncorrelated vendor “propaganda.”

Let us know what your thoughts are on the subject, either on Twitter or on our blog…

Yesterday Jeff Bonwick (Sun) announced that deduplication is now officially part of ZFS – Sun’s Zettabyte File System that is at the heart of Sun’s Unified Storage platform and NexentaStor. In his post, Jeff touched on the major issues surrounding deduplication in ZFS:

Deduplication in ZFS is Block-level

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS’s 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

Deduplication in ZFS is Synchronous

ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

Deduplication in ZFS is Per-Dataset

Like all zfs properties, the ‘dedup’ property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis. Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help.

Deduplication in ZFS is based on a SHA256 Hash

Chunks of data — files, blocks, or byte ranges — are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Deduplication in ZFS can be Verified

[If you are paranoid about potential “hash collisions”] ZFS provies a ‘verify’ option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not.

Deduplication in ZFS is Scalable

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you’re so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk — but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

What does this mean for ZFS users? That depends on the application, but highly duplicated environments like virtualization stand to gain significant storage-related value from this small addition to ZFS. Considering the various ways virtualization administrators deal with virtual machine cloning, even the basic VMware template approach (not using linked-clones) will now result in significant storage savings. This restores parity between storage and compute in the virtualization stack.

What does it mean for ZFS-based storage vendors? More main memory and processor threads will be necessary to limit the impact on performance. With 6-core and 8-thread CPU’s available in the mainstream, this problem is very easily resolved. Just like the L2ARC tables consume main memory, the DDT’s will require an increase in main memory for larger datasets. Testing and configuration convergence will likely take 2-3 months once dedupe is mainstream.

When can we expect to see dedupe added to ZFS (i.e. OpenSolaris)? According to Jeff, “in roughly a month.”

Popular Posts

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.