Post Your Comment

46 Comments

I wonder if we'll ever have more numerous smaller cores like these working in conjunction with larger traditional cores. A bit like the PPE and SPEs in the Cell processor, with the more general core offloading what it can to the smaller ones. Reply

The great thing about the Cell was that both the PPE and the SPEs had access to the same memory. Trinity doesn't and while that may be because there isn't an OS that would take advantage of that, hardware is as capable as software is efficient for that exact hardware solution.

There is no need for major parallelism in the consumer space, since nobody is willing to rewrite their programs to run on something faster whilst the general public is already served well enough by a Core i3 or i5. Reply

"The great thing about the Cell was that both the PPE and the SPEs had access to the same memory."

Hmm. This is not a useful statement.

Cell had a ludicrous addressing model that was clearly irrelevant to the real world. It's misleading to say that the cores had access to "the same memory". The way it actually worked was that each core had a local address space (I'm think 12bit wide, but I may be wrong, maybe 14 bits wide) and almost every instruction operated in that local address space. There were a few special purpose instructions that moved data between that local address space and and the global address space. Think of it as like programming with 8086 segments, only you have only one data segment (no ES, no SS), you can't easily swap DS to access another segment, and the segment size is substantially smaller than 64K.

Much as I dislike many things about Intel, more than anyone else they seem to get that hardware that can't be programmed is not especially useful. And so we see them utilizing ideas that are not exactly new (this design, or transactional memory) but shipping them in a form that's a whole lot more useful than what went before.This will get the haters on all sides riled up, but the fact is --- this is very similar to what Apple does in their space.Reply

That's exactly how this supercomputer, and all supercomputers offering accelerated compute, work. Xeon or Opteron CPUs handle complex branching tasks like networking and work distribution while the accelerators handle the parallelizable problem solving work.

Merging them onto a single die is simply a matter of having enough die space to fit everything while making sure that economics of a single chip is better than separate products.Reply

Sort of I suppose, but I think something like this would be easier to use for most compute tasks for the reasons the article states, these are still closer to general processor cores than GPU cores are. Reply

Something like ARM's big.LITTLE in a sense seems like a good idea to me. I'm not sure how feasable it is, but having one or two small Atom-like cores paired to larger and more complex Core processing cores all sharing the same L3 sounds like a decent idea for mobile CPUs to cut idle power use. My guess is the two types of cores would need to share the same instructions, so the differences would be things like OoO vs In-order, execution width, designed for low clock speed vs high clock speed. The Atom SoCs can hit power use around that of ARM SoCs, so if Intel can get that kind of super low power use at low loads and ULV i7 performance out of the same chip when stressed that'd be super killer. Reply

One rumor I had heard upon Larrabee getting cancelled and turned into Knights Ferry was that this technology might be released as a coprocessor that used the same socket as the "main" Xeon.

That you could mix-and-match them in one system. If you wanted maximum conventional performance, you put in 8 conventional Xeons. If you wanted maximum stream performance, you'd put in one "boot" conventional Xeon, and 7 of these. (At the time, there were also rumors that Itanium was going to be same-socket-and-platform, which now looks like it will come true.)Reply

That rumor has a grain of truth to it. A slide deck about Larrabee from Intel indicated a socketable version fitting into a quad socket Xeon motherboard. This was while Intel still had consumer plans for Larrabee which have since radically changed.

I don't even know what generally (and publically accessible) programs are available that you would be able to use to do this sort of HPC testing.

OpenMP code is sort of "easier" to come by. A program that has both an OpenMP and a CUDA version where it's a straight port - I can't even think of one.

The only one that might be a possiblity would be Ansys 13/14 because they do have some limited static structural/mechanical FEA capabilities that can run on the GPU, but I don't know how you'd be able to force it onto the Xeon Phis.

The next version of OpenMP should have accelerator suppport via the OpenACC scheme. I'd bet that most engineering applications will be able to support most accelerators like Phi, Tesla and APUs in a transparent manner simply through the math libraries, not perhaps in the most optimal but at least in a sufficiently worthwhile way.Reply

You can probably run Java on it, but it will not run well. Most Java code is application code - very branchy, something the Phi's memory architecture cannot handle well. The JVM certainly will not vectorize code either, so you have all those vector units being wasted.

This is really much closer to a GPU in terms of the kind of optimizations that must be done for performance, even if the underlying instruction set is x86.Reply

Like the people in charge of F@H would develop and release a new folding core so that it could run on one of these in the off chance some enthusiast has one of these multi thousand dollar cards and a computer system that can run it?

Not going to happen. This isn't a general CPU core that any existing software can run on, nor is it aimed at home users. Reply

In theory it should run the current the Linux version of F@H without modification. That catch is that the current version is going to be horribly suboptimal as it doesn't natively support the 512 bit wide vector format used by the Xeon Phi. This would leave only the x87 FPU for calculations. This would allow the 60 scalar FPU's to be used but limit performance to a mere 60 GFLOP across all the cores. There maybe some weird scheduling oddities with Linux and/or F@H due to the chips ability to expose 240 logical processors to the host OS (the result would be better performance from running multiple instances in parallel instead of one large instance using 240 threads).

An OpenCL version of F@H might be coaxed to working and it that would utilize the 512 bit vector units. Intel would have to have OpenCL drivers available for this to even have a chance of working. This would allow the full ~1 TFLOP performance to be utilized.Reply

Because they needed heavier duty vector units. Each Phi core has 32 512-bit registers, where Core i7 has 16 256-bit registers. They just didn't implement the backward compatibility, probably to reduce complexity. It is certainly possible to do, and we may indeed see AVX, SSE, etc. added in a future revision.Reply

The 512 bit vector instructions change how exceptions and the register masking are handled in comparison to AVX. Outside of that, the vector instructions are similar to how AVX instructions are formatted and the output complies with IEEE floating point standards. So while there is a distinct break in ISA capabilities, it does appear that it is possible to bridge the two together in future designs. Still it is odd that Intel has forked their ISA.Reply

I concur. A major dis-advantage to co-processor computing is the time it takes to move data on and off the card. The PCIe 2.0 bus is already a bottleneck in our workflow involving a Tesla card. This was a very short-sighted omission.Reply

It is dependent upon the bus encoding. PCI-E 1.0/2.0 use an 8/10 encoding scheme to handle traffic while PCI-E 3.0 uses 128/130 encoding. PCI-E only increases the clock speed of the bus by 66% with the rest of the bandwidth increase stemming from the more efficient encoding schema. Xeon Phi seems to have kept the PCI-E 1.0/2.0 encoding but supports the higher clock rate of PCI-E 3.0. This appears to be nonstandard but the LGA 2011 Xeons appear to support this for additional bandwidth.

Any overhead is likely adding full PCI-E 3.0 support in addition to PCI-E 1.0/2.0. Reply

can't deny that openMP code that automatically runs faster on the phi would represent a great solution for those looking for the speed up without the cost and time of modifying code for gpus. There certainly is a market to cater for with these cards. Reply

The numbers you describe as to configuration of the units doesn't add up.

Eight of the compute sleds plus two PSU sleds cannot fit into a 4U unit. Judging by the photo it appears that in this vertical configuration of nodes it goes something like this: {Compute, Compute, PSU, PSU, Compute, Compute } 4U. It appears this leads to a total of 5 chassis per cluster within the rack.

This is then compounded with two distinct clusters per rack with their own Infiniband switch + regular Ethernet switch. This makes for a total of 10 C8000 chassis per rack.

This makes sense when considering a 48U rack. 22U per cluster x 2 clusters per rack = 44U with a few spare slots at the top and middle. I could two at top and one in the middle.Reply

I think the author got some Dell facts mixed up. Looking at Dell.com you can fit 8 compute sleds in, but those compute sleds are half width and don't contain the necessary double wide PCIe slots to accommodate a Xeon Phi card. So in a single 4U unit, you are correct it can only hold 4 compute sleds and two power units as depicted in the Stampede picture.Reply

"Eight of those server sleds find a home inside the C8000 4U Chassis, together with two power sleds." did you write - but on the photo I only see 4x computing sleds plus 2x power. Where are the other 4?!Reply

There are only 4 per row in the chassis because these units in Stampede feature the Xeon Phi card which requires a bigger sled. The author got the potential specs messed up with the way they are actually configured for this supercomputer.Reply

So, it seems these are great at general purpose supercomputing.How do they stack up against the latest FPGAs if they are set up carefully by the people who will be running a specialized problem on them?And would these be able to work effectively with offloading of some key functions that would be able to work 20-100x faster (or power efficient) on a carefully set up FPGA?

Some people in the comments mentioned hetrogenous computing. A step on the way is modular accelerated code. I'm interrested to see if we get more specialized hardware for acceleration in the comming years, not just graphics (with transcoding) and encryption/decryption like is common in CPUs now. Or if we get an FPGA component (integrated or PCIe) that can be reserved and set up by programs to realize huge speedups or power savings.Reply

Why have there not been a lot of reports regarding the PCI express. This was the first source that I was able to find that even mentions the speed of the PCI e bus for the Xeon Phi.

One of the most challenging things for programming on accelerators is handling the PCI express and trying to balance data transfer with computational complexity. Everyone, NVIDIA, Intel, AMD seem to be doing a lot of arm waving regarding this issue, and there are many GPU papers that tend to omit the transfer times in their results. To me I find this dishonest and cheating.

One thing that continues to shock me as well is that people keep complaining about how difficult it is to debug a GPU program and then they reference old out of date references such as http://lenam701.blogspot.be/2012/02/nvidia-cuda-my... which was mentioned above. The things that the author of that blog post complained about have been resolved in the latest versions of CUDA (from 4.2 onward... maybe even in 4.0).

Programmers can now use printf and it is possible to hook a debugger into a GPU application to do more in depth debugging. The main thing that bothers me about GPU programming is you must check to make sure a program has successfully completed or not. Other than that I find it relatively easy to debug a GPU application.Reply

i would really like to see a benchmark of this cpu on LLVMpipehttp://www.mesa3d.org/llvmpipe.htmlThe original Larabee would have had a DirectX translation layer and this project could be seen as an OpenGL version of it.Just loading a distro with Gnome 3 running on LLVMpipe or benchmarking some ioq3 and iodoom3 games would be VERY interesting.Reply