With no expected delays in Intel’s server and workstation “tick tock” roadmap for the next several years, and a ARM replacing AMD as a chief competitor, what will Intel’s flagship Xeon chip look like?

This is the second part of a two-part series. Read the first part here.

Sometime in the summer of 2015, very likely close to the IDF fall 2015, Intel is expected to release Broadwell-EP or Xeon E5 2600 v4 — its first 14 nm Xeon processor. What might we see in there?

Continuing the strategy seen in Ivy Bridge-EP and Haswell-EP, Intel will not exactly speed up the cores, but simply pile up more of them on each die. Expect to see up to 18-core Broadwell-EP chips, with multiple die variants: one of them is expected to be eight-to-10 core high performance desktop and fast-core workstation one for situations where less cores but with speeds above 4 GHz (before Turbo, of course) are desirable.

Another one is 12-to-16 core enterprise server and workstation one, where the workstation SKUs, which end at 12 cores by the way, will support up to 160 Watts TDP per socket.

Now, it’s possible that “native” Broadwell-EP will end at 16 cores, and the 18 core variant will, again, come from Broadwell-EX. Do note that all these will still have 2.5 MB L3 cache per core, i.e. a total of humongous 45 MB L3 for the 18-core part, but the modifications that Intel introduced in Ivy Bridge EP Xeons allow surplus cores to be shut while keeping their extra L3 cache still used (the E5-2687Wv2 has 8 cores but 25 MB L3) will likely apply here as well.

After all, having up to 18 cores feed out of four channel DDR4-2400 memory path would lead to occasional congestions if there’s no good cache subsystem in between. DDR4-2400 will work with two slots per channel too, meaning 16-DIMM configurations on dual socket Broadwell EP will, in most cases, be able to run the memory at DDR4-2400 speed – which provides for up to a terabyte of RAM at full speed.

Knowing its latency issues, would DDR4 give it a real performance boost? Intel’s internal benchmarks show around 15-25 percent gains by moving to DDR4-2400 from DDR3-1866, while keeping all else same, on SPEC and Stream benchmarks, so yes it seems to be worth it, not forgetting the power savings from 1.2v memory operation.

So, a dual-socket Broadwell-EP Xeon system would have up to 72 threads processed in parallel via HT, support in excess of 2 TB DDR4 memory by support for 3DS LRDIMMs, and still have those famed 80 PCIe v3 lanes to add any sort of peripherals you’d dream of. Intel will, on their side, offer 40 GE (yes, forty gigabit Ethernet) chips as well as ‘Fultonvale’ 2 TB PCIe v3 SSD cards, besides, of course, PCIe v3 attachment of Knights Landing Xeon Phi if you want to use it as a coprocessor here.

How about the performance? In raw double precision FP rate, depending on the actual clock and if Turbo runs all on, some of the 16-core and 18-core SKUs may come close to 1 TFLOPs per socket, still less than one-third of their contemporary Xeon Phi match, but very impressive for a general purpose CPU. In other apps performance, don’t expect that much improvement unless your application scales well with the extra cores. OLTP database work is expected to see substantial gains here, for instance.

However, as certain Sun Microsystems VP in Singapore said nearly 11 years ago when I looked at poor SPARC performance vs. everyone else: “computing is not about speed”.

There’s more to life than speed

Sun is now a footnote in the history books, but we can say that computing is not only about speed. So, Broadwell Xeon generation will have a number of enhancements that help overall reliability and dependability of the system in virtualised datacentre, big data and HPC operations.

For instance, virtualisation features will see a big boost with adding of page modification logging, posted interrupts, cache quality of service monitoring & enforcement, and memory bandwidth monitoring. On the manageability side, processor trace capability will enable system and software debug by tracing instruction execution, while hardware can control the extra performance states in the case the OS doesn’t support them yet.

On security, there will be substantial cryptography speedup (will apply to Broadwell-E desktop CPU as well) by at least a quarter, supervisor mode access prevention to reduce common attack vectors at OS level, and new compliance for random number generation.

The very same Grantley platform with the C610 series Wellsburg chipset that we see with Haswell-EP will also be used with Broadwell-EP, and there will be no stepping update required. So, with a proper BIOS update, the upgradeability seems to be assured. By the way, something new will surface in a limited way on Grantley platform beginning with Haswell-EP and fully unroll with Broadwell-EP: support for NVDIMMs.

How will this impact Intel and its competition?

Even without per-core performance boost, Haswell-EP and Broadwell-EP continue the impressive performance domination by Intel. The company seems to be able to put its semiconductor process benefits to the fullest use here, keeping the per-socket performance advances going on steady on yearly cycle again, from E5v2 to E5v3 and then E5v4 – we have to see if the 2016 “Skylake” Xeon generation, with its AVX512 extensions and other new benefits, will keep the same announcement pace.

But, besides the raw performance, they are adding features which – only partly – AMD was the only one with some experience being able to potentially implement on enterprise Opterons, while the up –and coming server ARM vendors have little chance of knowing how to handle at this point.

Remember the situation 15 years ago? We had Alpha, HP-PA and MIPS as the performance leaders, all of them having both much higher performance and high-end feature sets vs the X86, not to mention being fully 64-bit for long time then vs 32-bit only X86. Yet, they were brutally – and definitely unfairly – beaten out of the Western markets (the Chinese Alpha, Shenwei, does very well, by the way).

Now, look at the ARM today. It’s barely getting 64-bit after over two decades of wait since its 32-bit inception, its per-core performance still competes only with Atoms for the foreseeable future, and let’s not even talk about commercial enterprise applications or hardware reliability, virtualisation and other server features, both at CPU and ecosystem level. Except for the planned AMD ARM/Seamicro low-end interconnect integration, there even isn’t a proper system interconnect there. What’s the justification to bother?

Not to forget: ARM is still less efficient than MIPS or Alpha on power and performance, and those two have greater entrenchment in markets like China, even though ARM is trying to make its entry there via Huawei and some HPC players.

Emerging China

A platform like Shenwei, essentially fully funded Chinese military with no support worries or shareholders and directors to please, and up to four TFLOPs DP, 256-core Alpha CPU (not GPU) chips on near term roadmap, with well developed hardware and software ecosystem and existing Top 10 systems since 2011, would easily bury ARM high end server efforts in China if they wish so.

Then, if you look at this story, and at the performance and features that Xeon EP will bring out over the next two years, and also not forget its bigger Xeon EX brethren with even more datacentre special sauce in there, then it becomes clear that any ARM competitor aspiring to take a bite out of juicy four-digit US$ priced Xeon CPU market must do a whole lot first, at every level – from CPU, to chipset, interconnects, system features, firmware and commercial software support.

Just the last one, mind you, can cost couple of billion to get support from top dozen enterprise software firms. Yes, the likes of Samsung, Qualcomm, Nvidia and Apple could afford this (AMD couldn’t, but its experience may be useful to one or all of these players), but would they agree to help each other on this, knowing they compete quite viciously between themselves, too?