Cavium Thunder Rattles Xeon

SAN JOSE, Calif. — Cavium will try to drive ARM SoCs into mainstream servers, challenging Intel's Xeon x86 with a family of 28 nm devices using up to 48 2.5 GHz custom 64-bit ARM cores

The networking specialist announced its Thunder family of products at the Computex event in Taiwan. The venue is significant because an increasing share of datacenter servers are built by Taiwan companies such as Foxconn and Quanta.

Thunder chips dissipate 20 to 95 watts for SoCs that contain multiple Ethernet controllers and other system elements, compared to more than 100 W for the Intel Xeon processors alone. In addition, Cavium will enable dual-socket designs that link 96 cores with a coherent processor interface it is already using it its high-end networking chips.

However, Cavium has yet to share performance figures for the chip, which is not yet in first silicon. It expressed confidence it will sample multiple Thunder products before the end of the year and have chips in volume production early next year, in part because as much as 70% of the designs use its existing silicon blocks.

Thunder marks a step beyond the so-called microserver market. To date, most ARM server SoCs have targeted this low-power system concept, which Intel is addressing with a new line of Atom-based server SoCs.

"Cavium is really able to position this chip against the heart of Intel's Xeon product line, and that will be much more interesting to customers because that's where the bulk of the server market is," says Linley Gwennap," principal analyst of The Linley Group, in Mountain View, Calif.

"The Xeon E5 is probably 80% of Intel's server business, so Cavium is hitting the heart of the server market rather than playing around in the fringes as the early ARM chips are doing," Gwennap notes.

Two Thunder processors can share a coherent link, creating a 96-core node sharing up to a TByte of memory.

Currently, AMD is shipping an eight-core ARM server SoC, and Applied Micro is rolling out a similar part. Marvell got some early design wins with its 32-bit Armada products but hasn't discussed any 64-bit plans yet.

"I think Cavium is head-and-shoulders above what other ARM partners are doing," Gwennap told us. "Broadcom [which announced its intentions last year] has a shot at the same performance levels, but it seems to be 6 to 12 months behind, and it hasn't announced products."

Broadcom said in October it will roll out SoCs with quad-threaded, quad-issue, out-of-order cores made in a FinFET process, presumably the 14/16 nm node. That process won't be broadly available from foundries until sometime next year, suggesting Broadcom is following its usual strategy of waiting to enter a market until it matures.

It's still early days for ARM servers. Startup Calexda closed shop in January after pursuing 32-bit SoCs. Recent reports say Samsung killed an ARM server project, although Samsung sources did not respond to questions. The Korean giant has hired several server processor experts from AMD and may continue to have long-term goals in the area.

"I don't think there will be huge revenues [for ARM servers] this year, and even next year will be an early ramp," says Gwennap. "Maybe its 2016 before there's a lot of revenue there -- it will take time for the infrastructure and software base to develop."

Taiwan motherboard maker Gigabyte will take part in Cavium's Thunder event in Taiwan. It made reference designs Cavium will ship toward the end of the year. No other ARM or x86 server makers that might be Thunder users are slated to take part in the event.

The debate on ISAs is interesting. I have designed x86 CPUs and other ISAs as well. It is a fact that x86 is inherently more complex than MIPS or ARM or PowerPC to varying degrees. There is certainly the CISC instruction decode penalty but there are other complex mechanisms that have been built into x86 over generations which still need to be supported by the latest x86 processors. All of these mechanisms take die-size and/or complexity. Almost every implementation of x86 CPU has a built in micro-code engine. This is like a programable engine within the CPU to handle these complex tasks. Intel has continued to stress floating point performance and each generation adds additional instructions adding transistors to the design.

So why is this relevant? This "overhead" becomes smaller in very high performance implementations - out-of-order, multi-threaded, large cache designs. Here the overhead can be amortized over the performance gains of a complex CPU. This is why Intel has competed well at the very high end compute but failed in low power efficient designs that are required for mobile.

In these less complex implementations where the CPU has fewer transistors, this overhead starts to make a difference. This is why the mobile processors from Intel and even the Atom cores have not competed so well.

Well it's obvious you've never looked in detail at the complexity of the x86 ISA. The overheads of x86 affect the whole microarchitecture. With an identical microarchitecture x86 would end up slower (and thus less power efficient). For x86 to achieve the same performance as a RISC, it needs a far more complex microarchitecture, increasing die size and power. You can compare die sizes for various ARM and x86 CPUs here: http://chip-architect.com/news/2013_core_sizes_768.jpg

The claim that x86 has a dense encoding is yet another myth. In fact the complex encoding means that x86 binaries are typically a little larger than ARM binaries, and significantly larger than Thumb-2. x64 is usually 15% larger than x86.

Yes I've read that paper and discussed it in detail on RWT. It is a badly written paper with most of the conclusions not supported by evidence. If you choose to compare wildly different and relatively ancient CPUs, an old compiler and completely ignore the memory system then of course the only possible conclusion is that microarchitecture matters the most! But that's only true if you make wild extrapolations and ignore or handwave at all other aspects. Let's hope this paper was a one-off mistake and doesn't reflect on the quality of papers coming from this university.

Note PPC is certainly not CISC. Neither is ARM or Thumb. PPC vs ARM is less interesting as their ISA features are nearly identical (not that there aren't differences but the differences tend to be insignificant details).

1. It's a myth that ISA overhead is just in decode. There are many aspects of an ISA that affect the overall microarchitecture. Just to mention one example, x86 requires more load/store units due to having fewer registers and load+op instructions. x86 also uses a more complex memory ordering model.

2. Given they designed their own CPU it seems likely Cavium are aiming for better than Cortex-A57 performance, as otherwise they could have just licensed that (the same argument applies to X-Gene). A 3-way in-order is not completely implausible, but to get decent throughput it would need to be at least 2-way and ideally 4-way multithreaded.

4. If all else is equal, an identically performing x86 would use more power than ARM due to its more complex ISA. So the x86 ISA really is LESS efficient. Of course different processes, microarchitectures etc can mitigate this difference.

In any case there is no doubt a dedicated CPU can outperform a generic Xeon despite having a process disadvantage (as you say in point 5). Beating Xeon on single-threaded performance is much harder of course, but that is not something Cavium or X-Gene are attempting (at least with their current line-up). For many tasks, using more, slower cores is actually far more energy efficient.

@Servernut: Applied definitely got out there early. I have written 3-4 stories about them so far. I am not a Computex so would love to hear the latest. For a while they have been in Cavium's spot: we have been waiting for them to ship and report performance specs. Anyone have an update on that?

@Rick. Cavium's Octeon designs are not out-of-order but in-order. Thunder is likely to be the same. All other server CPUs - Xeons, Opterons, even X-Gene are fully out-of-order machines.

Actually XGene from appliedmicro is completely missing from your post. They showed a mini-datacenter running at Computex this week. Any reason, you are not covering them? They seem to be shipping already. From the specs they seem to have everything that thunder is claiming and a few years ahead.

Now I'd like to hear some reality about where we are at and need to be at in server software for ARM if this borader initiative of which Cavium is just one part is going to get traction. Details, please!