TI fuels up KeyStone II ARM for HP Moonshot hyperscale servers

Hewlett-Packard is putting more ARM server processor options into its next-generation of "Project Moonshot" hyperscale servers - the latest one coming from Texas Instruments, which has been relatively quiet on the server front but plenty active in the ARM chip market at large.

In a blog post, Tim Wesselman, senior director of ecosystem strategy for the HyperScale Business Unit at HP, said that TI has joined the Moonshot PathFinder partner program and would be figuring out how to put its KeyStone II variants of the ARM RISC processor into Moonshot boxes for "large-scale, concurrent real-time processing of cloud and traditional telecommunications workloads."

Like all of the exciting ARM server processors either out the door or in the works from Calxeda, Marvell, and Applied Micro Circuits, the KeyStone II chips are not just ARM processors tricked out to run server workloads, but they also include integrated networking that is used to lash multiple server nodes together into a fabric.

While Advanced Micro Devices has not announced its plans, it seems likely that a future Opteron-branded ARM processor will include on-chip network adapters and links to the SeaMicro "Freedom" fabric at the very least, if not a distributed switch architecture like Calxeda has cooked up. If not, AMD should just not bother.

The word on the street is that HP will use the TI KeyStone II chips in its second-generation "Gemini" Moonshot servers, which the company previewed last summer. During that preview last June, the only processor that HP talked about was Intel's "Centerton" Atom S1200 server chip, which was announced in December last year, and it never mentioned ARM processors, not even once. (Funny that.)

HP has not provided much in the way of feeds or speeds for the Gemini machines except that they will use the two-core Atom S1200 processor, which has 64-bit processing, supports VT virtualization assist, and ECC scrubbing on main memory and is certified to run server variants of Linux and Windows.

HP's Project Moonshot Gemini enclosure

From this meager picture, it looks like the Gemini chassis is around 10 rack units high and will have two bays into which server "cartridges" will be loaded. That is the full extent of anyone's knowledge of Gemini machines from any public statements, and HP did not say, as some chatter suggests, that the TI KeyStone ARM processors will be used in the future Gemini chassis.

It has not said, either, how the Gemini machines will stack up compared to the "Redstone" Moonshot boxes that HP launched in November 2011 using the 32-bit Calxeda ECX-1000 ARM chips, which include an on-chip distributed Layer 2 switch that is very clever.

Because of Calxeda's long-time work with HP on Redstone and the fact that Calxeda's products are in the HP Discovery Labs now, Calxeda ARM chips will very likely be in the Gemini machines, but neither Calxeda nor HP would confirm this.

It is entirely possible that another Moonshot machine – perhaps Saturn or Apollo, depending on if HP is going to use the booster or the capsule name – is next and will be based on Open Compute's Group Hug microserver backplane and form factor and that this is where current and future ARM servers and maybe even future Atom, Xeon, and Opteron servers, will be used.

Those of us outside of the HP Discovery Lab's NDAs – or those given to potential customers for Moonshot boxes – just don't know. (And if you do, please, do tell.)

What El Reg can discuss is neat features of the KeyStone-II system-on-chip (SoC) designs. TI has cooked up plain-vanilla Cortex-A15 processors, which have two or four 32-bit cores with 40-bit memory addressing (known as Large Physical Address Extensions in the ARM world) as well as hybrid ARM processors that mix anywhere from one to four Cortex-A15 cores with from one to eight TMS320C66x digital signal processors into a single piece of silicon.

Block diagram of the KeyStone II system-on-chip from Texas Instruments

The interesting bit is that these ARM-DSP hybrids are using the same TMS320C66x DSP elements – and using the same TeraNet coherency network to lash them into an SoC – that TI was peddling as coprocessors for x86 iron back at the SC11 supercomputing event a little more than a year ago.

The architecture for the DSPs and the ARM chips is exactly the same and is known by the same KeyStone name, too. The difference now is that they can have ARM cores etched on them if you want, or if you don't want any DSPs at all and just ARM cores, then TI is fine with that, too.

It can dial back the DSP count on the hybrid chips for workloads to go after video, IP camera, traffic system, voice gateway, and medical device applications, and dial up the DSP count on the hybrid ARM-DSP chips for heavier workloads like supercomputing, video conferencing, image processing and analytics, medical imaging, and even virtual desktop infrastructure.

Eight of those DSPs can offer around 1 teraflops of floating point performance at single precision and around 384 gigaflops at double-precision, and the next-generation DSPs from TI are expected to do a lot better.

The thing to watch is the performance per watt. The plan was to be able to do 2 teraflops at 200 watts, and toss in a few ARM chips and you have a pretty interesting module for a ceepie-deepie supercomputer.

The DSPs run at up to 1.2GHz and have 1MB of their own SRAM Level 2 cache, and the two or four Cortex-A15 ARM cores share a 4MB L2 cache with each core having 32KB of L1 instruction and 32KB of L1 data cache. The ARM cores run at up to 1,4GHz and have ECC scrubbing on all of their caches, which is important for server workloads; the DSPs have soft error protection only.

The KeyStone II family of ARM Cortex-A15 processors

What is also important is that the KeyStone II processors have an integrated Ethernet switch right there on the chip. Presumably this switch will be able to link SoCs together in a switched fabric as Calxeda has done with its processors.

But it may not have as much oomph, since it is, according to the specs, only a five-port Gigabit Ethernet switch; one port faces the computing elements and four ports face out of the SoC to the outside world.

Hopefully, it is possible in software to create a flat Layer 2 fabric out of multiple SoCs and their inherent Gigabit Ethernet switches to make minimalist, dense-pack clusters. That network accelerator on the KeyStone II chip runs at 1Gb/sec wire speed and can handle 1.5 million packets per second of throughput, which could be also very useful for lots of cloudy and hyperscale workloads, too. ®