ScaleMP: Use RAM plus vSMP, not flash, to boost server performance

Partners with Big Blue, chases SGI UV2 shared memory systems

There are hypervisors that chop a single server into virtual bits, and other hypervisors that take multiple servers and make them look like one big virtual one. ScaleMP's vSMP hypervisor is the latter kind, and can be used to create a shared memory x86-based system that runs Linux that would normally require special processors and chipsets. And a much higher price tag.

ScaleMP started out peddling vSMP to customers as an alternative to big SMP machines like those from Silicon Graphics, IBM, Hewlett-Packard, and Oracle, but with the hype around big data these days, Shai Fultheim, founder and CEO at ScaleMP, says the company was seeing a need for larger memories than larger compute capacities, and has therefore, with vSMP 5.1, rolled up a new SKU of the product that is tuned specifically to take a bunch of cheap server nodes and use them as memory expansion boxes for a big fat node. The end result is a significantly less expensive – and asymmetric – shared memory system than what you get from an SMP box based on high-end x86 or RISC processors and special chipsets to handle terabytes of main memory.

The dirty little secret out there in the data centers of the world is that most of the database, middleware, and application code is not designed to scale across lots of cores and threads. "This software is not really written to use all of the processing power in a modern machine," says Fultheim. "The problem is not the CPUs. The problem is the memory."

Meaning, this software can run in ever-embiggening chunks of main memory and get a big performance bump. The trouble is, CPU and memory capacity in modern servers is pretty much locked down. The memory hangs off a processor socket and its controllers are on the processor die, and there are very prescribed memory capacities for machines with one, two, four, or eight sockets. In general, as a machine increases in aggregate CPU performance, main memory capacity also increases, but so does the cost of the processors and the memory sticks in the machine. And even if your job is not compute bound and you don't need all the cores and threads, if you want to make a fat memory system you are pushed into buying a big bad box whether you like it or not.

To serve the needs of analytics and other kinds of big data jobs where memory matters a lot more than compute, ScaleMP has ginned up the vSMP 5.1 aggregation hypervisor into two flavors. The first is called vSMP System Expansion version, which is tuned to scale up both processing and memory in a balanced fashion like a regular SMP server based on a physical chipset does. This is the vSMP that ScaleMP has been peddling for many years. The new flavor is called vSMP Memory Expansion, it is designed and tuned explicitly for machines that are going to be unbalanced – but in a good way.

There are a lot of ways to play this asymmetric configuration game with vSMP Memory Expansion, but the basic idea is outlined in the scenario in the first chart in this story. Rather than try to figure out how to put flash in a server to accelerate a database or big data workload, Fultheim says keep it simple and build the biggest memory space you can. A Fusion-io flash card has half the I/O operations per second of a chunk of memory for the same dollar, according to Fultheim, so it is the better option. (SGI, trying to push its Xeon E5-based UV2 shared memory supers with their NUMAlink 6 interconnect, would agree with this approach, as would IBM, HP, and Oracle with their big RISC or Itanium iron and fat SMP chipsets.) You can do a direct connect between up to four nodes using 56Gb/sec FDR InfiniBand host adapters and cables using the DC2 interconnect coded into the vSMP hypervisor since November 2009. Or if you want to scale up to 128 server nodes, you plug the servers into an InfiniBand switch.

In the example above, the workload in question only needs a four-socket Xeon E5-4600 or Xeon E7-4800 in terms of the processing capacity, but the 48 to 64 memory sticks in this box do not offer enough main memory capacity, and moreover, the fat memory needed to build up terabytes of memory space are very expensive. So instead of buying an eight-socket box to get more memory slots, you get the four-socket box and put in the faster Xeon or Opteron processors you can afford. Then you buy a bunch of skinny server nodes with 24 memory sticks each, and you turn off the cores and leave on the memory controllers and memory in the boxes as well as the InfiniBand ports, and now the FDR links are effectively a backplane for an SMP based on the vSMP hypervisor.

Fat memory, lots of skinny nodes

Depending on what memory capacities you choose for the memory sticks in these memory expansion nodes, you can get somewhere between 3.75TB and 7.5TB of main memory that is all directly addressable by the four-socket machine. All you threw away was around $200 per socket for the unused computing cores in the expansion machines.

This turns out to be a good trade off, as you can see:

Provided your workload runs well on vSMP, an asymmetric cluster is considerably cheaper than a real fat SMP box

That is a comparison that Fultheim cooked up showing the cost of the processor, memory, and Oracle 11i Enterprise Edition running on machines with two, four, or eight sockets with specific memory capacities. Oracle software is a lot more expensive on four-socket machines than on two-socket boxes. It also costs more to beef up the memory on any given server size because you are moving from 8GB to 16GB to 32GB memory sticks, and generally, the memory prices are not linear as you get fatter sticks because the cost of producing the denser memory chips is much higher than on lower-density chips. So if you want fat memory on a Xeon E7 server with eight sockets, it can get very pricey indeed, like well over $330,000 for a 4TB box.

Using vSMP memory expansion nodes linked to a four-socket machine, you could build the same 4TB system for under $200,000, or build an 8TB machine – that's twice the memory footprint of the physical eight-socket box – for around $270,000. That's about 20 per cent less money for twice as much addressable memory. And this comparison assumes, of course, that you are memory bound, not compute bound, with your database and analytics workload and that this workload runs on Linux and is amenable to the underlying messaging architecture of the vSMP aggregation hypervisor.

vSMP is available in two different flavors, with the memory expansion and system expansion variants within those flavors. vSMP Foundation scales up to 32 nodes in a single system image and addresses up to 32TB of maximum memory across those nodes. In the Memory Expansion variant, you can only have one fat node with the CPUs turned on and all of the processing on the other nodes is deactivated (with the memory controllers, memory, and I/O controllers obviously remaining on once the vSMP hypervisor is booted into memory and running). The Memory Expansion variant is priced based on the amount of memory addressed in the cluster, and you can get a perpetual license for $10,240 per TB or an annual subscription for $6,144 per TB.

With the System Expansion variant, all of the processors in the 32 nodes can be activated and do computing along with addressing the main memory. A perpetual license costs $400 per socket and an annual subscription costs $240 per socket.

With vSMP Foundation Advanced Platform, the vSMP aggregation hypervisor can scale up to 128 nodes and up to 256TB of addressable memory across the cluster of servers. This Advanced Platform variant also has the ability to support active-active multi-rail InfiniBand links between servers and switches, with up to four host channel adapters yielding up to 224Gb/sec of bandwidth into and out of the node for that virtual SMP memory addressing. The Advanced Platform also allows for the virtual SMP to be partitioned into virtual machines. With the Memory Expansion variant, Advanced Platform once again only allows for one node to do actual computing; it costs the same as the vSMP Foundation with $10,240 per TB for a perpetual license or $6,144 per TB for an annual license. With the System Expansion variant, Advanced Platform costs twice as much at $800 per socket for a perpetual license and $480 per socket for an annual subscription.

Within the next month or so, ScaleMP will roll out a variant of vSMP Foundation called Memory Expansion Free, which will, as the name suggests, be available for download at no cost. Memory Expansion Free will have one compute node and up to a total of eight nodes in a cluster; it is also limited to four sockets of processing in a machine and 1TB of aggregate main memory across the server nodes in the cluster. You can use SUSE Linux Enterprise Server 11, Red Hat Enterprise Linux 5 or 6, or Oracle Linux 5 with the freebie version, just like the full-on version.

The Memory Expansion Free edition will rely on community support, not 24x7 tech support from ScaleMP, and will have limited expandability. But it should still be useful for small installations and proofs of concept, and you can't beat the price.