Cray's Baker pops out of the oven as company "re-learns" how to make great systems

It’s been nearly a month since Cray took the code name away from Baker and announced its official designation — the XE6 — and made it an “official” product (although it is not yet shipping). This is an important launch for the company that doesn’t have much room for error.

Cray has hundreds of millions of dollars tied up in orders for a product that isn’t scheduled to ship until Q4 of this year, when the final silicon for its new interconnect switch will be complete. Many of those orders are $40+ million dollar deals, with substantial penalties for late delivery. Previous iterations of the XT line have mostly been refinements to earlier designs, but Baker is a significant change of technology with a lot of new system software to go with the updated silicon. Delays are obviously a big part of the risk — some bug that prevents the company from shipping anything — but in some ways an even worse scenario is a product that ships and appears to be working until it is assembled and tested at large scale, leaving customers hanging while the company scrambles to debug in the field.

Just before launch I talked with Barry Bolding, Cray’s veep for scalable systems (the high-end stuff, contrasted with the lower-end CX line), to get a feel from him about where the product is headed and what it means for the company’s bottom line.

Expecting twins

The big news with the XE6 is the interconnect. Earlier generations of Cray’s high-end, AMD-based systems were based on the SeaStar interconnect, a custom interconnect that ran on a chip of its own. That interconnect went through two major generations, through the XT6 line. The XE6 introduces a new interconnect called “Gemini” which is actually an early version of the Aries interconnect being fielded as part of Cray’s Cascade DARPA system (according to Bolding, Cray has now decided that interconnect chips are going to be named after constellations, in case you are curious about these things).

Gemini features two network ports (two, twins, “Gemini”), and supports the same 3D torus topology that the SeaStar has. This means that customers with XT5 and XT6 systems will be able to upgrade their cabinets to XE6s just by replacing the communication chip (but because of backplane limitations XT3 and 4 customers are out of luck). If you are planning an in-place upgrade, this is the generation for you: Cascade changes the interconnect topology entirely, and at that point your only upgrade options will involve 18-wheelers and a forklift.

Current XT owners will appreciate Gemini for its added resilience, especially for its warm swap capability. If you lose a node you’ll be able to swap it out without rebooting the whole system to rebuild the network — a Good Thing. Gemini also adds a hardware-supported global memory space back into Cray’s product line. Those readers who developed for the T3E once upon a time will be right at home, with one-sided communications that don’t have to go through the OS. Bolding also says that future generations of the interconnect will include hardware support for collective operations, a feature that is becoming increasingly common in the network layer of commodity-built clusters.

According to Bolding, latency is “less than 2 microseconds,” and messaging rates on Gemini are about 100x those of SeaStar with about 150 million messages per second across the chip. Gemini also supports adaptive routing to move traffic away from congested routes, and uses a high radix router with multiple lanes per route between any two connections. This means that a hardware problem can shut down a lane but still leave the channel open.

So how’s it look?

“We already having several multiple cabinet systems running in-house,” says Bolding when I ask him how confident he is in the new hardware. He goes on to say that while this is a good sign, it is still possible for unexpected things to happen at scale, which is consistent with the cautiously optimistic stance of the rest of Cray’s leadership about the launch.

But Cray’s solid record of innovation and phenomenal fiscal discipline must have built up a lot of goodwill with customers. “What I am most pleased about,” Bolding explains, “is the breadth of customer buy in. There is a rich diversity in both the range of workloads and the size of systems that customers have ordered.” According to Bolding, Cray has booked many mid-sized (10-70 cabinet) systems already, in addition to the high profile, high dollar awards to the likes of the DOE and the DOD. This is important because bugs in these systems sometimes only show up at large scales; shipping smaller systems that work just fine will enable the company to realize some revenue from the new system even if problems do show up in the big boys. “This puts us in a totally different situation financially than we were in in 2008 when ORNL’s 200 cabinet system was the make-or-break deal for us.”

(Re)learning how to build great systems

For Bolding there is more to this diversity than just a little risk management for the 2010 financials. He feels the shift reflects a deeper change in the market’s acceptance of Cray’s technologies, “We’ve gone from 2-3 partners driving our business to have a diverse range of true customers, people who don’t want to help create great technology, but who want to just buy and use it,” he says.

He credits the shift to hard work under the covers on the way Cray get things done inside the corporate walls. “We designed a good system beginning with the XT3 and XT4, but we listened carefully to those customers, and for the XT5 and XT6 systems we really focused on optimizing and improving both the hardware and software in those systems in response to feedback,” Bolding says. “Cray has added a lot around 6 Sigma and internal processes to make great systems — in many ways, as a company we have re-learned how to build great systems.”

As with the previous scalable systems, customers can expect a mid-sized XE6m system to be announced by Q1 2011. Gemini won’t be available right away on this system, however Bolding explains that because the m’s top out at 6 cabinets, SeaStar is just fine for those systems. Once Gemini is released for the m series customers will be able to upgrade their XT6m systems to XE6m.

Resource Links:

Latest Video

Industry Perspectives

In this episode, the Radio Free HPC team splits on the topic of Net Neutrality. The FCC will soon publish its new rules for ensuring an even playing field for Internet Bandwidth. "Dan doesn't like the idea one bit. Henry disagrees and thinks we need Net Neutrality to keep the Comcasts of the world from running amok. As for Rich, he just finds the whole argument rather amusing since it's pretty much a done deal." [Read More...]