ScaleMP takes self in hand, pumps its offering even bigger

With server nodes getting more cores and fatter main memories, you might be thinking that the need for larger symmetric multiprocessing (SMP) servers, whether they are physical, like the ones sold by the major system makers, or virtual, like those created using the vSMP hypervisor and interconnect from ScaleMP, would be diminishing in popularity. Not so.

Workloads continue to swell and companies are also virtualizing and consolidating those workloads onto fewer – and embiggened – SMP boxes. Some customers want a big, fat, expensive SMP box like the RISC or Itanium Unix machines sold by Oracle. Fujitsu, IBM, and Hewlett-Packard, but they can't afford them. So they are looking to create a virtual SMP box using the aptly named vSMP Foundation software from ScaleMP, which uses a fast InfiniBand or Ethernet link and sophisticated caching software to lash multiple x86 server nodes into a shared memory system.

Back in September, ScaleMP previewed an upcoming vSMP Foundation 4.0 release, touting the fact that this would be the first version of the vSMP hypervisor to run atop Advanced Micro Devices Opteron processors – in this case, the prior generation "Magny-Cours" Opteron 6100s, which are used in two-socket and four-socket physical SMP servers. With vSMP Foundation 4.0, the vSMP hypervisor can link together as many as 128 physical server nodes and address up to 64TB of main memory across those nodes, an in the case of the 16-core Opteron 6200s, that means being able to bring as many as 8,192 cores to bear on a single address space.

Creating such a large machine would require a very fast, low latency network, and thus far no one has stress-tested vSMP under such loading with real-world production applications. The vSMP clustering technology, which works by having the hypervisor predict what data any given node will want and get it to any processor in the cluster before there is a cache miss, performs best with applications that do a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. You would not necessarily want to run a back-end ERP system and related database on it.

With vSMP 4.0, ScaleMP is making a few changes to support new chip and networking technology and is also tweaking its pricing and packaging. First, vSMP 4.0 has been tuned for the just announced "Interlagos" 16-core Opteron 6200 processors from AMD. (No surprises there.) The Interlagos chips sport 16 cores and up to 384GB per socket, so customers using ScaleMP can cram 25 per cent more cores and three times the main memory in each Opteron node. Shai Fultheim, the company's founder and chief executive officer, says that vSMP foundation is already pre-tuned for the forthcoming "Sandy Bridge-EP" Xeon E5 processors from Intel, which are due in early 2012 in two-socket and eventually four-socket machines. While ScaleMP had been tuned to support the much different Xeon 7500 processors, used in four-socket and eight-socket servers, with the vSMP Foundation 3.5 release a year ago, and has had some tweaks to run tippy-top on the 10-core Xeon E7 processors from Intel.

Perhaps more significantly for Intel-based vSMP clusters, ScaleMP has worked closely with Intel and InfiniBand host bus adapter provider Mellanox Technologies to take advantage of the integrated PCI-Express 3.0 controllers on the Xeon E5 chips and the Fourteen Data Rate (FDR) InfiniBand protocol, which is supported in its ConnectX-3 adapter cards. The effective bandwidth of a Quad-Data Rate (QDR) InfiniBand adapter, which has a peak speed of 40Gb/sec, is 26 Gb/sec on a PCI-Express 2.0 slot, but on a PCI-Express 3.0 slot, FDR InfiniBand peaks at 56Gb/sec but can be driven at around 52Gb/sec. That is twice the bandwidth at about the same latency.

"What this means for vSMP is that we can do a lot more predictive caching," says Fultheim, and that means vSMP runs better. How much will depending on your workload.

To make HPC customers happy, the vSMP 4.0 stack has a tweaked Message Passing Interface (MPI) offload engine, one that is tuned to work better with small MPI messages to the extent that workloads run four times faster. The prior MPI offload engine in vSMP 3.5 worked best with large MPI messages where the chatter between nodes is lower and therefore the overhead of using MPI is lower. Fultheim says that depending on the workload, using the MPI offload engine built into the vSMP software imposes somewhere between a 5 and 10 per cent performance penalty, which you might be thinking HPC customers were very stingy about giving up. But being able to manage hundreds of nodes as a virtual SMP, creating and destroying fat nodes on the fly, makes a cluster more effective at running a larger number of HPC workloads, so come supercomputer centers (like those at manufacturers running the LS-DYNA finite element analysis tool) are making this choice.

The biggest change with vSMP 4.0 is the pricing, which is lower than what ScaleMP was charging for vSMP 3.5. With the repackaging announced concurrent with the vSMP 4.0 release, there are two editions of the veritable clusterer. vSMP Foundation is the base product, which can be used to link together up to 32 server nodes. These server licenses are "node-locked," which means they are pegged to a specific physical server and the perpetual license cannot be transferred to a new node. The base product includes support for all interconnects and node sizes and support for dual-rail, active-passive InfiniBand networks. This costs $400 per server socket.

vSMP Foundation Advanced Platform provides floating software licenses, which means you can repurpose those licenses on new hardware as you retire old hardware and bring in shiny new boxes. Advanced Platform also adds in the ability to create virtual SMP partitions within a cluster on the fly, or to link the creation of said partitions back to HPC job schedulers such as Bright Cluster Manager, Insight CMU, ROCKS, and xCAT. You will need Advanced Platform to do active-active dual-rail InfiniBand between nodes; this version supports as many as four host adapters per node for a maximum of 208Gb/sec of effective bandwidth between nodes. Advanced Platform is also needed if you want to scale up from 32 nodes toward the ceiling of 128 nodes.

Pricing for vSMP Foundation Advanced Platform is a bit more complex. The base price starts at $800 to $1,600 per socket, with the price going up as the processor type changes, thus:

In addition to those tiered per-socket charges, Advanced Platform carries a memory tax. The first 32GB of memory are built into the per-socket price; after that, each additional gigabyte per socket costs you an extra $10. If you want ScaleMP to ship you media for each node (usually on a USB stick), that costs $25 per node. The first year of maintenance and support is included with the pricing above; after that, you pay 25 per cent of the license price per year, 20 per cent if you prepay year two's maintenance when you first buy the vSMP licenses.

Even though the vSMP software only runs Linux operating systems on its virtual bare metal, it has seen a rising uptake. (You can run Windows inside of a KVM or Xen hypervisor running atop vSMP, which some of ScaleMP's hosting provider customers do.) Fultheim says that the company has over 300 customers and they in aggregate now have over 3,000 nodes clustered. vSMP first started shipping in 2006, and it took two years to get 300 nodes in the field clustered using its wares. By the end of 2010, the installed node count quadrupled to 1,200, and as 2011 is coming to an end, the company is supporting over 3,000 nodes in the field. About half of those customers are running homegrown applications, and about 40 per cent of them are in what we would call traditional HPC sites. Another 15 per cent or so are manufacturing and engineering firms, another 15 per cent are government facilities, and the rest are scattered across financial services, healthcare, and other firms. ScaleMP is located in Cupertino, California, and has its research and development labs in Israel.

It is also perhaps plump enough for someone to acquire it. Hewlett-Packard and IBM would do so to keep virtual SMPs from hurting actual SMP sales, and Dell might do it to own the virtual SMP market, after having already acquired RNA Networks earlier this year. ®