Intel takes wraps off 50-core supercomputing coprocessor plans

Intel has formally announced its plans for commercializing the remains of the …

Intel's Larrabee GPU will finally go into commercial production next year, but not as a graphics processor. Instead, the part will make its debut in a 50-core incarnation fabbed on Intel's 22nm and aimed squarely at one of the fastest growing and most important parts of NVIDIA's business: math coprocessors for high-performance computing (HPC).

When Intel's ambitious, hybrid software/hardware GPU effort failed in late 2009 due to delays, Intel insisted that the silicon side of the project would live on in some form. The next year, the company announced that Larrabee had morphed into the Knight's family of HPC coprocessors, which the company began shipping in very limited quantities as a research testbed. Intel also began calling the basic architecture of the Knight's family is Many Integrated Core (MIC) architecture.

Today's announcement is the official unveiling of Intel's broader plan to commercialize the MIC-based Knight's family, starting with the 50-core Knight's Corner chip on 22nm. Intel is also announcing partnerships with SGI and other system integrators that plan to build commercial HPC systems around the MIC silicon.

The MIC products will compete directly with NVIDIA's Tesla line, making MIC a threat to NVIDIA's growth prospects in a world where integrated processor graphics (IPGs) like Sandy Bridge and AMD's Llano are eating the discrete GPU market from the bottom up.

The main advantage that Intel touts vs. Tesla is that because MIC is just a bunch of x86 cores, it's easy for users to port their existing toolchains to it. (When using Tesla, researchers must port to NVIDIA's proprietary but well-loved CUDA platform.)

In a previous discussion of the MIC vs. Tesla issue, I suggested that Intel was massively overselling this ease of porting, since applications must be redesigned anyway. But having learned a bit more in the intervening year, I'm not quite as certain that this is the case. Ease of development and porting do seem to matter for both budget-constrained academic labs and the kinds of high-frequency trading and other finance applications that prize speed of deployment along with fast computation.

Even if it does turn out that x86 gives Intel a big advantage, it's not all smooth sailing for MIC. One thing that's missing from the press materials that Intel sent was a set of relative performance claims vs. Tesla, and that's probably because the more specialized Tesla would crush it on the kinds of codes for which Tesla is commonly used. It's the age-old general-purpose (slower, easier to use) vs. specialized (faster, harder to use) tradeoff, and Intel is betting that since Tesla has so far been the only real option there are plenty of potential users out there who are in the market for something less specialized.

None of the press materials I saw indicated that Intel is changing its interconnect architecture for Knight's Ferry, so it looks like the chipmaker will be hanging 50 cores off a single, high-performance yet power-hungry ring bus. This is surprising, since I would have expected the company to move to a tile architecture (à la SCCC) for a core-count this high. A tile-based design is no doubt in MIC's future, but there are certainly issues that remain to be worked out.

It's great for SW businesses around the globe that Tesla will get a major competitor. Not so much for NVIDIA...Anyhow, I don't think x86 ISA is a major advantage - you already have main CPU running all the x86 goodies.

What you offload to GPU is relatively simple code that needs to be executed gazillion times. If C++ (or Fortran) compiler produces good code, the need for assembly coding is minimal. Yes, you could hand optimize the tightest of the tight, but anyhow the new AVX instruction set is unfamiliar to coders. That's not your everyday x86. But still, assembly language is not scary, even if it is NVIDIA's variety ;-)

On a more serious note: Wasn't the purpose of OpenCL to program once and let the OS allocate the workload to CPU, GPU, and whatever co-processor? How -- using OpenCL -- would programming MIC be easier than Tesla?

On a more serious note: Wasn't the purpose of OpenCL to program once and let the OS allocate the workload to CPU, GPU, and whatever co-processor? How -- using OpenCL -- would programming MIC be easier than Tesla?

If it's just x86 cores, in theory, OpenCL should just work on it (perhaps with tuning?).

If it's just x86 cores, in theory, OpenCL should just work on it (perhaps with tuning?).

It probably would, the problem is that large scale parallel programming is not the same as writing a single thread, or even quite writing multi-threading of dissimilar tasks. Typically you need to re-think what you are doing from the ground up to not smack yourself with a nurf de-performance bat.

I don't see why x86 really matters here. Its not like theres some installed based on x86 GPU style chips out there. CUDA and OpenCL applications would probably be one of the uses for a device like this.

Of course, Intel could package a bunch of these 50 core CPU's on one PCIe card, but it would take at least 9 CPUs to have an equal number of processor cores, so that is probably not going to happen, at least for some time.

Tesla boasts 1030 GFlops single precision, and 515 double precision on one board. So, plug two boards in a system, and you have 1-2 teraflops floating point performance...

I saw that as a phoenix, reborn from the ashes. (Firebird is a more interesting story, but then, I've started liking Russian literature more than Greek mythology.) Nice photo art... though the idea of a 50-core coprocessor running without a heatsink and connected with only 32 pins is sort of amusing!

Yeah, but in this case it might not be a joke. Intel is presenting this as a coprocessor, but I wonder if there's any fixed-function GPU silicon in it. In other words, is this a GPU primarily aimed that computing that Intel is presenting in this way because they think it's best for now, or is it really just a pure coprocessor?

It's an interesting question, especially when you consider the future of desktops/laptops: will Intel add Knights Corner cores to their Haswell CPUs and their successors in 2013+? If so, does that mean the integrated GPU will continue to grow as well, contending for silicon resources with the main CPU cores and the KC ones? How will this little three-way be managed?

Or will Intel try to optimize silicon usage by using Knights cores for graphics and throughput computing as well? In that case, is the Knights architecture meant to be modular, so as to be coupled with TMUs and ROPs for graphics on consumer CPUs (or rather, APUs) but with the ability to function without them for HPC variants?

Or will Intel keep Knights on the HPC side, with desktop APUs featuring regular CPU cores and an updated, beefed-up, OpenCL-capable GPU, derived from the current Sandy Bridge GPU? But in that case, it might be difficult to get any decent computing performance out of the integrated GPU, unless it's made more compute-friendly. But then why not just use Knights?

Of course, Intel could package a bunch of these 50 core CPU's on one PCIe card, but it would take at least 9 CPUs to have an equal number of processor cores, so that is probably not going to happen, at least for some time.

Tesla boasts 1030 GFlops single precision, and 515 double precision on one board. So, plug two boards in a system, and you have 1-2 teraflops floating point performance...

From the linked article:

“The programming model advantage of Intel MIC architecture enabled us to quickly scale our applications running on Intel Xeon processors to the Knights Ferry Software Development Platform,” said Prof. Arndt Bode of the Leibniz Supercomputing Centre. “This workload was originally developed and optimized for Intel Xeon processors but due to the familiarity of the programming model we could optimize the code for the Intel MIC architecture within hours and also achieved over 650 GFLOPS of performance.”

So, while they're not beating nvidia yet, they ARE in the ballpark (i.e. maybe a factor of 2 behind). Unless of course they were talking double precision GFLOPS, in which case Intel is competitive now.

If it's just x86 cores, in theory, OpenCL should just work on it (perhaps with tuning?).

It probably would, the problem is that large scale parallel programming is not the same as writing a single thread, or even quite writing multi-threading of dissimilar tasks. Typically you need to re-think what you are doing from the ground up to not smack yourself with a nurf de-performance bat.

If you're using OpenCL (in any competent way), you've already got that covered.

Of course, Intel could package a bunch of these 50 core CPU's on one PCIe card, but it would take at least 9 CPUs to have an equal number of processor cores, so that is probably not going to happen, at least for some time.

Tesla boasts 1030 GFlops single precision, and 515 double precision on one board. So, plug two boards in a system, and you have 1-2 teraflops floating point performance...

From the linked article:

“The programming model advantage of Intel MIC architecture enabled us to quickly scale our applications running on Intel Xeon processors to the Knights Ferry Software Development Platform,” said Prof. Arndt Bode of the Leibniz Supercomputing Centre. “This workload was originally developed and optimized for Intel Xeon processors but due to the familiarity of the programming model we could optimize the code for the Intel MIC architecture within hours and also achieved over 650 GFLOPS of performance.”

So, while they're not beating nvidia yet, they ARE in the ballpark (i.e. maybe a factor of 2 behind). Unless of course they were talking double precision GFLOPS, in which case Intel is competitive now.

There aren't many applications that can scale past 650 GFLOPS on Tesla in single precision. With DP, it can't go over 515 GFLOPS. So if it's SP, it's competitive, if it's DP, it's better.

"the more specialized Tesla would crush it". So...it WASN"T FAST ENOUGH to beat NVIDIA as a GPU, now it's repackaged as a processor that's NOT FAST ENOUGH to beat NVIDIA as a parallel processor. It's only advantage is is easier to write for?, pity MS just released C++ AMP that makes it super easy to program GPUs using normal C++.

Of course, Intel could package a bunch of these 50 core CPU's on one PCIe card, but it would take at least 9 CPUs to have an equal number of processor cores, so that is probably not going to happen, at least for some time.

Tesla boasts 1030 GFlops single precision, and 515 double precision on one board. So, plug two boards in a system, and you have 1-2 teraflops floating point performance...

Not all cores are created equal. On the graphics side Radeons have thousands of cores but tend to be neck and neck with nVidia on performance. Not to mention Telsa is a mature technology and this is little beyond a prototype.

"the more specialized Tesla would crush it". So...it WASN"T FAST ENOUGH to beat NVIDIA as a GPU, now it's repackaged as a processor that's NOT FAST ENOUGH to beat NVIDIA as a parallel processor. It's only advantage is is easier to write for?, pity MS just released C++ AMP that makes it super easy to program GPUs using normal C++.

HPC wire has a bit more about the story they specify the original 32 core 45nm version as having in excess of a Teraflop though and this one as having more obviously. Are Nvidia and AMD counting flops from fixed function hardware ? They also raise the spectre of integrating this sort of thing with the CPU an idea I have been mulling over for some time, seems like a logical step to me.

It seems like Intel is saying "Okay, so our 50 core processor can't compete with our competitors GPUs or APUs, let's rebrand it and sell it as a 'supercomputer' worthy platform for 'super-efficient' cloud virtualization".

Yeah, but in this case it might not be a joke. Intel is presenting this as a coprocessor, but I wonder if there's any fixed-function GPU silicon in it. In other words, is this a GPU primarily aimed that computing that Intel is presenting in this way because they think it's best for now, or is it really just a pure coprocessor?

It's an interesting question, especially when you consider the future of desktops/laptops: will Intel add Knights Corner cores to their Haswell CPUs and their successors in 2013+? If so, does that mean the integrated GPU will continue to grow as well, contending for silicon resources with the main CPU cores and the KC ones? How will this little three-way be managed?

Or will Intel try to optimize silicon usage by using Knights cores for graphics and throughput computing as well? In that case, is the Knights architecture meant to be modular, so as to be coupled with TMUs and ROPs for graphics on consumer CPUs (or rather, APUs) but with the ability to function without them for HPC variants?

Or will Intel keep Knights on the HPC side, with desktop APUs featuring regular CPU cores and an updated, beefed-up, OpenCL-capable GPU, derived from the current Sandy Bridge GPU? But in that case, it might be difficult to get any decent computing performance out of the integrated GPU, unless it's made more compute-friendly. But then why not just use Knights?

So, yeah, can it run Crysis?

I am reasonably sure a 50 core larrabee solution can run Crysis, the question is really whether it can run Crysis cost effectively when compared to a GPU solution? And if comparable in cost/performance, would it allow for RTRT (something that current GPU design don't)? I suspect, however, that recent improvements in engines would be sufficient that RTRT is less interesting to game developers and therefore this a purely a question of aiding HPC development. Which is no small thing - time is money and simple to program cores are far more useful than more complex to program cores (even if the language isn't that scarey because optimizing the program probably is a major PITA). I think the article covers this fairly well but there is still the hanging question:

The big advantage of the GPU camp here is their ubiquity and the fact that their future is pretty much guaranteed. Not sure if the whole HPC market is big enough to sustain another dedicated product.

I mean if I want over 50 cores on a chip then IBM has tried that extensively with its BlueGene CPUs. Not sure how successful that is. nVidia has the advantage that their solution is so widely available. I can develop and test CUDA products on a 500$ PC. (and play Crysis on it in my free time)

To be fair, the BlueGene systems are a PITA to program on. So heavily restricted, it's absurd. You can't even create a new thread, many system calls are restricted, and don't even think about doing any job management on your own.

I could see this being incredibly useful in the HPC world, specifically BECAUSE it's basically a cluster on a chip. This ties in really well with the article a while back on intel's program providing hardware (that looks remarkably like this) to institutions for free for research purposes. Being able to use a single workstation to debug an MPI code is highly valuable.

I'm curious though how partitionable these cores would be. Could they be scheduled for multiple processes by something like SLURM, and would they have the current blocking behavior that multi-core (but not usually multi-socket) architectures have with most MPI implementations? If they can't be and do, respectively, then it kinda sucks for some HPC code, but would still be useful for other HPC code.

And yes, the HPC market is large enough, in dollars, to easily support more dedicated products like these, assuming they stand up to scrutiny. And Intel has a lot more experience in this area than Nvidia, not to mention amazing support options. Not many people I know would pick Nvidia over Intel for a general purpose cluster, either.

On the one hand, I think it's really silly for Intel to be engaging in intense competition against NVIDIA when AMD and ATI are instead united. Do you really want to lose your only video card supplier?

On the other, a chip like this is more flexible than a GPU, so I do think it has its place. Although the overhead of the x86 architecture may be a bit much. So a chip with, say, 64 ARM cores on it, as a complement to a chip with a few powerful x86 cores on the one hand, and a GPU on the other...

RTRT is hell on memory, because the access is all over the place as rays bounce. While current silicon can do it FLOPS-wise, the memory access can't keep up.

Related to that, there wasn't any mention of the memory configurations of these MIP systems. GPUs have 2GB of close-coupled, superfast RAM on the board to mitigate memory bottlenecks; that's a benefit for GPGPU. What is MIP using?

They need to show "real applications" making use of this hardware. Lots of FLOPS numbers get thrown around but actual usage can be far from it.

Can someone explain why x86 is seen as such an advantage? Would it really be common to take x86 assembly code and run it on this processor anyway? My impression is that people (even in HPC) are mostly writing in languages like Fortran, C, C++, OpenCL and Cuda. Why would anyone care whether the compiler is compiling to x86, AMD's assembly language etc.?

Yeah, but in this case it might not be a joke. Intel is presenting this as a coprocessor, but I wonder if there's any fixed-function GPU silicon in it. In other words, is this a GPU primarily aimed that computing that Intel is presenting in this way because they think it's best for now, or is it really just a pure coprocessor?

It's an interesting question, especially when you consider the future of desktops/laptops: will Intel add Knights Corner cores to their Haswell CPUs and their successors in 2013+? If so, does that mean the integrated GPU will continue to grow as well, contending for silicon resources with the main CPU cores and the KC ones? How will this little three-way be managed?

Or will Intel try to optimize silicon usage by using Knights cores for graphics and throughput computing as well? In that case, is the Knights architecture meant to be modular, so as to be coupled with TMUs and ROPs for graphics on consumer CPUs (or rather, APUs) but with the ability to function without them for HPC variants?

Or will Intel keep Knights on the HPC side, with desktop APUs featuring regular CPU cores and an updated, beefed-up, OpenCL-capable GPU, derived from the current Sandy Bridge GPU? But in that case, it might be difficult to get any decent computing performance out of the integrated GPU, unless it's made more compute-friendly. But then why not just use Knights?

So, yeah, can it run Crysis?

I am reasonably sure a 50 core larrabee solution can run Crysis, the question is really whether it can run Crysis cost effectively when compared to a GPU solution? And if comparable in cost/performance, would it allow for RTRT (something that current GPU design don't)? I suspect, however, that recent improvements in engines would be sufficient that RTRT is less interesting to game developers and therefore this a purely a question of aiding HPC development. Which is no small thing - time is money and simple to program cores are far more useful than more complex to program cores (even if the language isn't that scarey because optimizing the program probably is a major PITA). I think the article covers this fairly well but there is still the hanging question:

Will it be used to run Crysis?

But for it to run Crysis at all, it needs DirectX support, which means drivers, and that's not something Intel will do unless it features fixed-function graphics hardware, otherwise it's just not worth it.