Post Your Comment

15 Comments

At my workplace we have a fairly well developed MPI/OpenMP environment. We've dabbled with a Tesla card, but we would like to avoid re-writing everything in OpenCL. Even then, we don't know how long nVidia will support OpenCL.

Excited to see if/when this will actually be released, and since we are a single-precision application, if it can hold a candle to the ridiculous speed the K10 cards are exhibiting.Reply

I migrated my Brownian motion SP code from OpenMP to CUDA quite easily, got a factor 375x speed up over a single Nehalem core using a GTX480, Though tbh, the code was only 1000 lines max and was easier to do than expected. Reply

Each? Or 4 of them put together? Because if that's per-card, I'm not very impressed considering,

"For comparison, a quadcore Haswell at 4 GHz will deliver about one fourth of that in 2013."

For 300W, you can put together on the order of 10 Haswell quad-cores! That'd give you about 2.5x the max theoretical performance for the same wattage as the Xeon Phi (and, I'd imagine for a fraction of the cost as well...)Reply

Very valid points. However, I don't have any measurement nor real benchmarks yet. The 300W is - to my understanding - the upper limit. The last time I tested, Linpack can make a CPU consume 30-35% more than a typical integer application, both running at 100% CPU load. Reply

Do you have more info about the 2Ghz frequency ?It seems very high for that kind of chip. Maybe the 1 TFlops in double precision can be achieved with an FMA instruction (considered as 2 floating point operations) :1GHz * (512/64) * 64 cores * 2 ops per cycleReply

Actual speed would be quite a bit less than 2 Terahertz double precision. If we assume 70% efficiency [completely pulled out of nothing], we would get 1.4 Terahertz double precision.

Is there confirmation that this is true (aside from the efficiency estimate since I doubt Intel has released that information yet)? Is there also confirmation that each Xeon Phi SoC only has a TDP of 75 watts? If so that is astounding.

The threads don't add any more peak flops performance. They're here only to approach this performance peak.4 threads per core means there is 4 complete sets of registers in each core.For example, if a thread, currently executed, doesn't use all the unit of the core, another thread can use it.So two thread can't use the same resource that an other thread in the same time but if the resources (number of ALU, FPU, decode and dispatch units etc) per core is still the same, its use it more efficient.

So for me it's a 1GHz (maybe a little more) chip with 64 cores. Each could run a Fused Multiply and Add instruction (like on the future AVX2 instruction set of Haswell). It means 2 instructions/cycle on 512bits (so 8 double precision floats) = 1TFlops in double precision peak performance (2*64*8).So maybe the frequency is a little more than 1Ghz to achieve the 1TFlops in double precision on LINPACK like they said. But with this kind of architecture and the 4 threads/core, the real performance won't be that far from the theoretical performance unlike the GPU where it's about 60%.Reply

I just checked on my Ivy Bridge processor, and I can reach the theoretical performance peak with the Intel Linpack Benchmark (http://software.intel.com/en-us/articles/intel-mat...I have 82 GFlops in double precision. The theoretical perf are 8 double precision floats / cycles. At 2.6GHz (3720QM) on 4 cores it's 83.2.So I'm now pretty sure that it will be the same with Xeon Phi. And the frequency will be 1GHz.

I don't think that the power consumption will be only 75W per card. If you remove the power for the RAM, it will means around 1 Watt/ core. It's the power consumption of an ARM core. I think it's more 3-4 Watts/ core.Reply

1 core has several threads, but that is just to keep the flow going. For FLOPs, you should focus on the vector unit, no the pipeline or threads. So each vector unit can do 8 DP, not 16. Core is around 2 GHz