Cray brings top supercomputer tech to businesses for a mere $500,000

Technology powering world's top supercomputers now in an entry-level package.

Cray, the company that built the world's fastest supercomputer, is bringing its next generation of supercomputer technology to regular ol' business customers with systems starting at just $500,000.

The new XC30-AC systems announced today range in price from $500,000 to roughly $3 million, providing speeds of 22 to 176 teraflops. That's just a fraction of the speed of the aforementioned world's fastest supercomputer, the $60 million Titan, which clocks in at 17.59 petaflops. (A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.)

But in fact, the processors and interconnect used in XC30-AC is a step up from those used to build Titan. The technology Cray is selling to smaller customers today could someday be used to build supercomputers even faster than Titan.

Titan uses a mix of AMD Opteron and Nvidia processors for a total of 560,640 cores, and uses Cray's proprietary Gemini interconnect.

XC30-AC systems ship with Intel Xeon 5400 Series processors (CORRECTION: Cray had told us this comes with Xeon 5400, but the product documents say it actually comes with the Xeon E5-2600). It's the first Intel-based supercomputer Cray is selling into smaller businesses, what it calls the "technical enterprise" market. (Cray's previous systems for this market used AMD processors.) Perhaps more importantly, XC30-AC uses Aries, an even faster interconnect than the one found in Titan.

In short, it's "the latest Intel architectures with the latest Cray interconnect" that will be installed in "future #1 and top 10 class systems," Cray VP of Marketing Barry Bolding told Ars. "It's like buying a car model where you're getting exactly same engine you're getting in a top-of-the-line BMW. The only thing that's changing are some of the peripherals that make the system easier to fit into a data center and make it more affordable." Compared to systems like Titan, Cray says XC30-AC has "physically smaller compute cabinets with 16 vertical blades per cabinet."

Oil and gas firms or electronics companies performing complex simulations are among the potential customers for an XC30-AC supercomputer.

XC30-AC is a followup to the XC30 systems which are meant for larger customers and typically cost tens of millions of dollars. The "AC" refers to the smaller systems being air-cooled instead of water-cooled. The power requirements aren't as immense, and using air cooling makes it easier to install in a wider range of data centers.

XC30 systems scale up to 482 cabinets and 185,000 sockets with more than a million processor cores. The XC30-AC goes from one to eight cabinets, with each holding 16 blades of eight sockets each for 128 sockets in each cabinet. With an Intel 8-core Xeon processor in each socket, that adds up to 1,024 sockets and as many as 8,192 processor cores in an eight cabinet-system. A single cabinet with about 30TB of usable storage and 128 sockets would cost about $500,000, while eight cabinet systems with 140TB of usable storage and 1,024 sockets hit the $3 million range.

To begin with, the XC30-AC supports only Intel Xeon processors with Sandy Bridge architecture. Those will be updated to server-class Ivy Bridge chips later on. Nvidia GPUs and Intel Xeon Phi chips will become available as co-processors by the end of the year, Cray said.

While XC30-AC systems will be smaller than traditional supercomputers, the Cray Aries interconnect makes it incredibly fast, Bolding said. He noted that Ethernet interconnects generally aren't fast enough for the world's fastest supercomputers. InfiniBand has really taken off, being used in about half of the top 100 systems and two of the top 10. But the top five systems in the world all use custom, proprietary interconnects such as Cray's or IBM's.

Aries supports injection bandwidth of 10GBps and 120 million get and put operations per second. "Injection bandwidth" is less than the full system's bandwidth. As a paper on the interconnect's architecture notes, the "global bandwidth of a full network exceeds the injection bandwidth—all of the traffic injected by a node can be routed to other groups."

Latency is another key factor, with Aries providing point to point latency of less than a microsecond, Bolding said. Moreover, latency remains strong when a cluster is going full blast. "When a system is very busy and sending messages from one end of the machine to another across a fully loaded network where everything's working at once, Cray's latencies are literally almost as good as they are in point to point. They go up to around two or three microseconds," Bolding said.

The speed allows memory to be shared across processors. "No matter how many nodes you have you can actually treat it as if it's a shared memory machine, every node can talk to every other node, directly into the memory of that compute node," Bolding said. "That's something that is very powerful for certain types of applications and programming models."

Aries also features more sophisticated network congestion algorithms than the previous generation, preventing messages from getting backed up during times of high usage.

As for software, XC30-AC comes with the SUSE-based Cray Linux Environment also used in Titan, allowing customers to run almost any Linux application, Bolding said. While some of Cray's other systems are designed to run any form of Linux a customer wants, the XC30-AC comes with software optimized for the system. This allows it to be ready to go shortly after it comes out of the box, instead of requiring a week of setup.

Who will buy an entry-level supercomputer?

Cray isn't the financial success it once was, with its latest earnings announcement showing a year-over-year drop in quarterly revenue from $112.3 million to $79.5 million. The company also experienced a net loss of $7.6 million. Cray fared better in fiscal 2012, with full-year revenue of $421.1 million and net income of $161.2 million.

High-performance computing revenue is on the rise, with supercomputing products ($500,000 and up) leading the way according to IDC. HPC and supercomputing revenue is growing faster than the server market as a whole.

Cray is hoping to take its share of that revenue by selling both the smallest and largest supercomputer-class systems. While the XC30-AC was just announced today, it's been shipping for a few weeks. Early customers include an unnamed "Fortune 100" commercial electronics firm whose R&D department needs a powerful machine for simulations.

The oil and gas industry has a need for such machines to model oil fields. Biotechnology, engineering, and various manufacturing industries may provide interested customers as well, Cray says.

We've written about the trend of Amazon and other cloud services being used for supercomputing, with one-off jobs costing up to several thousand dollars an hour. Those are generally for customers that have only occasional need for a supercomputer, however. Many businesses would use a supercomputer often enough that owning one is more cost-efficient. Cray is betting a lot of Fortune 500 companies and universities that can't afford giant clusters costing tens of millions of dollars will be interested in systems like the XC30-AC.

"The complexity of problems that mid-range customers, technical enterprise customers face today are becoming so complex that they do need a tightly integrated supercomputer," Bolding said. "They can't always get away with a more conventional Ethernet cluster."

55 Reader Comments

A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.

Is it just me, or do people prefer reading large numbers as "thousands of billions" instead of using scientific notation? It seems a little obtuse to be describing them as such instead of a simple 10^12.

A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.

Is it just me, or do people prefer reading large numbers as "thousands of billions" instead of using scientific notation? It seems a little obtuse to be describing them as such instead of a simple 10^12.

Thousands of billions sounds more impressive than scientific notation. It conveys the fact that this will play Crysis.

A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.

Is it just me, or do people prefer reading large numbers as "thousands of billions" instead of using scientific notation? It seems a little obtuse to be describing them as such instead of a simple 10^12.

I was thinking the same thing... The words are becoming cumbersome to the point of being useless, and scientific notation would be so much easier and efficient... This is a quasi-scientific forum, so I would think most of the readership would be familiar with this notation, or he able to easily learn and understand it.

A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.

Is it just me, or do people prefer reading large numbers as "thousands of billions" instead of using scientific notation? It seems a little obtuse to be describing them as such instead of a simple 10^12.

Yes, "thousand billion" is stupid; but "million million" is all right. In the good old days, the BBC used to talk that way to avoid ambigiouty. "Billion" could mean either 10^9 (American) or 10^12 (old-style British, similar to other European languages).

Is it just me, or do people prefer reading large numbers as "thousands of billions" instead of using scientific notation? It seems a little obtuse to be describing them as such instead of a simple 10^12.

I was thinking the same thing... The words are becoming cumbersome to the point of being useless, and scientific notation would be so much easier and efficient... This is a quasi-scientific forum, so I would think most of the readership would be familiar with this notation, or he able to easily learn and understand it.

Actually I don't like scientific notation. "2*10^6 atoms" is much less clear that "2 million". But after you get to about 10^9 things flip. I think it is high time more people learned the Mega, Tera, Peta, ... system. So good on Ars for trying to teach it. But on a geek forum like this, maybe it should just be taken as assumed knowledge.

This article (and most of the press-release-regurgitating articles around the Web) gets the processor wrong; these systems use Xeon E5-2600 series chips (Sandy Bridge-E), not the rather ancient Xeon 5400.

This isn't actually that much of a premium over the classic home-made cluster of E5-2600 1U dual-socket machines with FDR Infiniband connectivity ($6000 for each node, $650 for the network card, $250 for the switch port); I think you're paying Cray about 10% extra and getting a much more physically compact machine.

This article (and most of the press-release-regurgitating articles around the Web) gets the processor wrong; these systems use Xeon E5-2600 series chips (Sandy Bridge-E), not the rather ancient Xeon 5400.

This isn't actually that much of a premium over the classic home-made cluster of E5-2600 1U dual-socket machines with FDR Infiniband connectivity ($6000 for each node, $650 for the network card, $250 for the switch port); I think you're paying Cray about 10% extra and getting a much more physically compact machine.

But is it a problem that the thing is more closed? Does it make upgrades or upkeep more expensive? Does it somehow make you come back for a millin dollar support contract? Or maybe that sort of thing is going to happen to you anyway.

128 processors per (I assume) 42U cabinet seems like lower density. I don't have a good reference point for systems like this. And I understand that it is air cooled. Do any other readers know how many high end processors you can jam into a 42U cabinet?

It's been sold to SGI, then to Tera .. the only thing the same by now is the name.

Also, doesn't this read kinda like a press release to you guys?

As much as I have huge respect for old-school Cray, and the magnificent plumbing-and-wiring fu that went into their hardware(plus the cool chassis designs), this hardware is as distant from the original Cray as the corporate structure is.

Between AMD and Intel on the CPU side, and now the GPU compute crew, it hasn't been ecnonomic for anyone(except a few IBM Power systems here and there) to actually do processors anymore. I think Cray's last in-house processor was some sort of vector accelerator add-on in 2007ish.

What Cray sells you now is interconnect(along with system integration). If you have a relatively loosely coupled problem, and you can get away with infiniband or 10GbE, blades, or even boring 1U commodity-boxes, have exactly the same compute punch. If you need tighter coupling, or single-system-image across a whole lot of sockets, though, Cray has the stuff that makes that work.

FYI, the injection rate quoted in the article is incorrect. It should be 10GB/s or 80Gbps, according to the cited paper. For sustained bi-directional traffic, the throughput slows to 7.5GB/s still significantly faster than today's 56Gbps infiniband or 40G Ethernet network cards.

It would be a shame if Cray's fastest network is no better than 10G ethernet.

The new XC30-AC systems announced today range in price from $500,000 to roughly $3 million, providing speeds of 22 to 176 teraflops. That's just a fraction of the speed of the aforementioned world's fastest supercomputer, the $60 million Titan, which clocks in at 17.59 petaflops. (A teraflop represents a thousand billion floating point operations per second, while a petaflop is a million billion operations per second.)

Yes, it's a fraction of the speed of Titan, but Titan is actually cheaper per FLOP by a lot.

Lets do that math shall we? XC30-AC at its cheapest is 22 tflop for $500k. You get 8x the performance for 6x the money on the $3million XC30.

128 processors per (I assume) 42U cabinet seems like lower density. I don't have a good reference point for systems like this. And I understand that it is air cooled. Do any other readers know how many high end processors you can jam into a 42U cabinet?

For simplicity i'm going to refer to processor count per system as 1S, 2S, 4S as 1, 2 and 4 socket.

Well, typical 2u systems can do up to 4SYou've got blades where you could put 8 4S blades in 10U, 16 2S blades in 10U or 32 2S blades in 10U.You've also got some "Cloud" type converged architecture systems where you could put 8 2S "sleds" in 4U

The highest density that i can see is 64 CPU per 10U with Dell quarter-height blades, you'd be able to do 256 CPUs in a 42U cabinet. Though, powering and cooling might be an issue for most datacenters as each chassis can be configured with up to 6 2700W power supplies. 64kW is just a LOT for a rack

128 processors per (I assume) 42U cabinet seems like lower density. I don't have a good reference point for systems like this. And I understand that it is air cooled. Do any other readers know how many high end processors you can jam into a 42U cabinet?

With 1Us you'd have ~80 processors per rack; with blades you could have up to ~160. Fitting 128 and the Aries interconnect seems reasonable.

We had a cray sales critter in recently to tell us what they had going on with big storage and compute. It was the same stuff we were doing from scratch (lustre, infiniband, MPI) except 4x as expensive.

It's been sold to SGI, then to Tera .. the only thing the same by now is the name.

Also, doesn't this read kinda like a press release to you guys?

I'm sure old Seymour would be devastated and is rolling in his grave over the fact that the company with his name on it still sells the fastest computers in the world.

Cray's genius was always in seeing a whole system for the supercomputer and designing accordingly. Though sort of like Einstein and quantum theory, he was very late in accepting the role of MPP in supercomputing.

I'm enough of a geek that I thought it was really cool that my company was using the old CDC datacentre where the CDC 6600 (direct predecessor of the Cray-1) was designed and deployed.

XC30-AC systems ship with Intel Xeon 5400 Series processors, and it's the first Intel-based supercomputer Cray is selling into smaller businesses, what it calls the "technical enterprise" market.

Judging from the picture, Cray is using E5-2600 series Xeons on socket LGA 2011.

Going by the block diagram, the system could double the number of sockets: it is possible to configure four socket LGA 2011 based processors into one logical node. This would also double memory capacity. Physically fitting all of these chips and the necessary cooling into nice chassis is another matter.

While Gemini was built on top of Hypertransport, Aries is built on top of PCI express. I don't know how that works out, but I'd take Gemini over Aries, since it's one step closer to the CPUs.

Specialized companies like Numascale also have solutions on top of Hypertransport so you can wire racks of amd machines together to act as one. Fascinating tech, as long as your codes do know what NUMA is and how to act in such environment.

XC30-AC systems ship with Intel Xeon 5400 Series processors, and it's the first Intel-based supercomputer Cray is selling into smaller businesses, what it calls the "technical enterprise" market.

Judging from the picture, Cray is using E5-2600 series Xeons on socket LGA 2011.

Going by the block diagram, the system could double the number of sockets: it is possible to configure four socket LGA 2011 based processors into one logical node. This would also double memory capacity. Physically fitting all of these chips and the necessary cooling into nice chassis is another matter.

Four Socket Intel you need to move to the E5-46xx series, E5-26xx series doesn't work in 4S

While Gemini was built on top of Hypertransport, Aries is built on top of PCI express. I don't know how that works out, but I'd take Gemini over Aries, since it's one step closer to the CPUs.

Specialized companies like Numascale also have solutions on top of Hypertransport so you can wire racks of amd machines together to act as one. Fascinating tech, as long as your codes do know what NUMA is and how to act in such environment.

I can't speak to the veracity of the Gemini/Aries foundations, but I would note that as of Nehalem Intel processors integrate the PCIe controller into the processor.