Yeah, I'm really curious about the form factor of a socketed variant. It sounds like it could be an extremely interesting combo with a dual-socket Xeon board for massively parallel workloads, perhaps even eating straight into NVIDIA and AMD's GPGPU lunch.Reply

Probably as much as it did eat from their GPU lunches. AMD already has a part delivering 2.6 DP tflops at 28 nm. And knowing Intel, this chip will probably cost an arm and a leg, and a set of good kidneys.Reply

DRAM memory doesn't dissipate very much power in the first place so cooling it should not be a big challenge. Traditionally, most of the power burnt by high-bandwidth memory ICs is because of the large, high-power transcievers needed. 7GHz GDDR5 is >50% transciever IIRC. Stacked memory on the CPU substrate itself should require much less power... The Crystalwell eDRAM used in Haswell GT4 CPUs burns what, a watt or two at most? The tech used here is probably more advanced*.

Very interesting architecture that I'll be interested in seeing the impact of. However I'm wondering, with these massively paralel chips with up to 72 cores, if the x86 ISA isn't going to be a huge hindrance. It's still a fairly big part of the chip, if you look at say a Jaguar floorplan it's about as big as an FPU, one of the biggest functional units on a CPU core. Times that by 72, and you're using a lot of die size and power just for ISA compatibility. I'd wonder if this is an area where starting from scratch wouldn't benefit Intel tremendously, lest ARM swoop in with its smaller ISA and do something better in less die area. Reply

No, Itanium failed because nobody could make a good compiler for it, including Intel. VLIW sucks for general-purpose computing. It works for a few workloads (graphics, mainly) but on anything relying on memory performance it's worthless.

x86 decoding isn't hard - it's instruction reordering that takes up all that die space. Itanium tried to do all that at compile-time, which doesn't work because it needs to know which data is cached or not.Reply

VLIW has been around for a while, and you can make some ridiculously fast programs with it... In assembly. Even AMD dabbled with it in their 5000-6000 series GPUs, but yes, compiling generalized problems is a mess and why Intel positioned Itanium for infrastructure and application specific computing in recent yearsReply

The Itanium business is still larger than all of AMD, hah. But yes, that didn't catch on so much that it was ready to replace x86, but that was going in the opposite direction as far as I know. Instead of reducing ISA complexity, it added some with even longer instructions. Intel is more limited going downwards to low power places rather than high power, the more you shrink cores to low power budgets the more that x86 ISA starts to look too large. Reply

Having more complicated instructions is beneficial in the long run especially when you break it into micro code. They can change how the microcode works with every generation as opposed to ARM or other CISC where they stay with the SAME instruction internally for years and years which bottlenecks performance. It helps somewhat with power, but as you scale up that is irrelevant.Reply

ARM has an instruction decode block too. It's probably not terribly far off from an areal perspective, at a given level of performance. It may have been the case back in the 80s/90s, but the differences between RISC and CISC are much more nuanced now.

RealWorldTech has an article from the year 2000 showing that while the differences between RISC and CISC had closed considerably, they still mattered. However, that was 14 years ago.

The ISA is basically meaningless today, in the context you're speaking of. The underlying architecture and its implementation is far, far more critical.

Take the Apple A7 vs. Intel's Silvermont. Despite being fairly close in performance, two Silvermont cores are significantly smaller than two Cyclone cores. The area difference worsens when you include the A7's L3 cache, which Silvermont designs do not rely on.

Even ARM to ARM comparisons can vary widely, even between custom implementations. The A7 is a dual core design that outpaces the quad core Krait 400. The area between the CPUs is fairly similar, although the A7 probably loses out by a small margin with the L3 cache included.

Back to RISC vs. CISC, The University of Wisconsin published a paper comparing the A9 vs. AMD's Bobcat and Intel's Saltwell Atom. Their conclusion: "Our study shows that RISC and CISC ISA traits are irrelevant to power and performance characteristics of modern cores."

True. I too am sick and tired of hearing the mentioning of ARM as this huge, inevitable looming monster on the horizon for Intel and proponents of x86. It isn't. Saying that Krait 400 = Cortex-A15 and Cyclone = A57 is like placing Bulldozer beside K10 and making direct comparisons. The different implementations of ARMv7 and ARMv8 in the ARM world are wholly different entities in terms of CPU and cache design.Reply

I believe you have a rather valid point there: Creating a massively parallel SIMD enhanced general purpose CPU to compete with GPU was certainly a valid exercise, because it could be productive and effective on a far wider range of problems, a far wider existing code base and a far wider population of programmers.

With ARM64 (or MIPS or any other new/cleaen 64-bit CPU design for that matter) more silicon real-estate might be used to create additional cores. How much or how many and how relevant that would be vs. the die area used for caches I don't know. Perhaps yields could be improved, because smaller complex logic core size means perhaps more cores for the same compute power but less is lost for a defect.

What I could not gauge is how the code using these SIMD AVX-512F instructions would actually be written or coded these days: x86/AVX-512F assembler won't easily convert to similar AVX-512F instructions on ARM64, but high level language code just might--with the right compiler.

Because Intel doesn't just do the CPU but (I believe) provides compilers and libraries around them, they most likely have a lead of a couple of years against any direct ARM competition.

But these days it's become far too easy to add and use FPGA or other special function IP blocks on ARM SoCs and all of a sudden Intel might potentially find itself in an arena with far more "knights" they ever imagined.

They can't quite escape the fact that any fixed workload on a general purpose architecture can be outperformed by a specific purpose one. More than ever chips aren't "best" or "better" but their quality or fitness depends on the mix of problems they are used for.Reply

The thing with ARM is that some instructions in earlier iterations (I'd have to check ARMv8) that don't need to be decoded at all. That's where mainly the RISC vs. CISC differences come into play today: die size. The area savings by having a simpler decoder In the case of Knight's Landing, how many cores could be added if the decoder was half its current size? 80 instead of 72?

That's the problem that Intel is currently facing. Intel isn't using their high IPC cores but ones typical found in Atom. ARM designs can reach similar levels of IPC but potentially at a smaller die area. Thus even if Silvermont wins on IPC the competing RISC based SoC's could still have higher throughput due to more cores.

Intel does have an ace up their sleeve by having their own fabs. This enables them to produce larger dies and maintain similar raw areas for CPU's by having a fab process advantage. However, the fab process advantage is not going to be long lived due to the difficulties of future shrinks. Reply

x86/x64 ISA compatibility really isn't the problem -- you could make the argument that these ISAs are large, and kind of ugly -- but its affect on die-size is minimal. In fact, only between 1-2 percent of your CPU is actually doing direct computation -- and nearly all the rest of the complexity is wiring or latency-hiding mechanisms (OoO execution engines, SMT, buffers, caches). Silvermont, IIRC, is something of a more-direct x86/x64 implementation than is typical though (others translate all x86/x64 ISA to an internal RISC-like ISA), so its hard to say, but I think in general these days that ISA has very little impact on die size, performance, or power draw -- regardless of vendor or ISA, there's a fairly linear scale relating process-size, die-size, power consumption, and performance.Reply

You'll undoubtedly see it, but the question is when. Stacked DRAM won't be debuting in consumer products until late this year at the earliest. AMD will likely be the first to market, with their partnership with Hynix for HBM. There was a chance Intel would put Xeon Phi out first, but with a launch in the second half of 2015, that's pretty much erased.

Still, initial applications will be for higher margin parts, like this Xeon Phi, and AMD's higher-end GPUs. It will be some time before costs come down to be used in the more cost-sensitive APU segment.Reply

Well a slightly different type of stacked DRAM is already in wide use on consumer devices today: My Samsung Galaxy Note uses a nice stack of six dies mounted on top of the CPU.

It's using BGA connections rather than silicon through vias, so it's not quite the same.

What I don't know (and would love to know) is whether that BGA stacking allows Samsung to maintain CMOS logic between the DRAM chips or whether they need to switch to "TTL like" signals and amplifiers already.

If I don't completely misunderstand this ability to maintain CMOS levels and signalling across dice would be one of the critical differentiators for silicon vias.

But while stacked DRAM (of both types) may work ok on smartphone SoCs not exceeding 5 Watt of power dissipation I can't help but imagine that with ~100Watts of Knights Landing power consumption below, stacked DRAM on top may add significant cooling challenges.

Perhaps these silicon through vias will be coupled with massive copper (cooling) towers going through the entire stack of CPU and DRAM dice, but again that seems to entail quite a few production and packaging challenges!

Say you have 10x10 cooling towers across the chip stack: Just imagine the cost of a broken drill on hole #99.

I guess actually thermally optimized BGA interconnects may be easier to manage overall, but at 100Watts?

I can certainly see how this won't be ready tomorrow and not because it's difficult to do the chips themselves.

Putting that stack together is an entirely new ball game (grid or not ;-) and while a solution might never pay for itself in the Knights Corner ecosystem, that tech in general would fit on anything else Intel does produce.

My bet: They won't actually put the stacked DRAM on top of the CPU but go for a die carrier solution very much like the eDRAM on Iris Pro.Reply

All the talk about the importance of co-processors, anyone else remember how big a deal it was when integrating math co-processors way back when?History sure likes to repeat itself (in a good way this time at least)Reply

Yea, I remember back in the late 80's using chromatography control and analytical software that had a math co-processor. I think the computer's main cpu was something like 50 mhz, all the programs and data were stored on floppy disks. Installing the board for the coprocessor and getting the software to work properly was a real bear.Reply

I don't see these competing that well against gpus. Binary compatibility isn't usually that much of an issue in HPC. Using something like OpenCL allows you to run across a wide variety of hardware architectures. The run-time compilation should not be much of a factor since it is very fast compared to the rest of the processing. Also, adding more extensions to the ISA seems unnecessary and possibly counter productive. It gets you "closer to the metal", but it also locks the implementation in place. Using an intermediate layer allows greater changes to the underlying implementation in the future.Reply