Xeon Phi Lacks Binary Compatibility, Breaks AMD64 Conventions

Three weeks ago Intel launched their Xeon Phi coprocessor card with much fanfare. We took the time to analyze some technical details about the chip that cast some doubt on Intel’s claim of x86 compatibility. It should be noted that Intel attempted to word their press release very carefully, to only make claims about the x86 programming model. Still, we believe it should be pointed out that beyond the basic x86 instruction set and programming model, Xeon Phi is quite a different beast compared to current and past x86 CPUs.

To get a basic understanding of Larrabee, it should be known that the single cores inside are based on the P5 microarchitecture which debuted in 1993. A slightly evolved version of that microarchitecture is used in Intel’s Atom line of CPUs and now also in the Many Integrated Core (MIC) architecture which Larrabee ultimately culminated in. Besides the same ancestry, the similarities between Atom and Xeon Phi in terms of feature sets end after basic x86 instructions and their 64-bit extensions which have been added to the microarchitecture.

On page 657f in the "Knights Corner Instruction Set Reference Manual" which can be downloaded via a link in this forum post aimed at developers, Intel details which registers and instructions are not supported in the Knights Corner architecture. This includes any instructions operating on MMX, XMM and YMM registers, more or less all of the instruction set extensions introduced over the course of the last 17 years ? namely MMX, any iteration of SSE and AVX.

The same manual also contains a description of what is supported. This includes the basic x86 instruction set as well as the additions for Intel 64, which is Intel?s moniker for AMD64, the well-known 64-bit extension of x86. It also supports the x87 FPU instructions, which have been integrated since the arrival of the 486. Intel also added a new set of 32 512-bit wide ZMM registers that are accompanied with a new vector instruction set operating on those registers. It is possible to operate on vectors of 32-bit and 64-bit integer and floating point values, making them 16- respectively 8-wide. The gory details of these instructions are explained in the reference manual as well.

As a consequence of these architectural changes, binary software compatibility is improbable. At the very least, it is required to recompile software to be able to run on Larrabee. Depending on the actual implementation of software one may want to run on Intel’s MIC, it might also be necessary to put in some reengineering effort. Just to put it into perspective, Intel is throwing out over 15 years of x86 CPU innovations in this case. Handwritten SIMD code is basically worthless on Larrabee. This means that any HPC applications that rely on SIMD optimizations have to dedicate effort to rewriting portions of their code.

On the other hand, code written and compiled for Larrabee is not compatible with all the other x86 CPUs out there, since the use of the new vector instructions is necessary to extract any meaningful performance out of Larrabee. At this point it is unknown whether these will be included in future Intel CPUs aimed at servers/desktop/notebook/etc. The only exception to both of these hard rules is code that uses only basic x86 instructions or their 64-bit equivalents.

Intel even acknowledges this issue in a blog aimed at developers, but strains that this should mostly affect tools developers:?This combination of Linux, 64-bits, and new vector capabilities with an Intel® Pentium® processor-derived core, means that Knights Corner is not completely binary compatible with any previous Intel processor.?

To make matters worse, Intel breaks quite a few conventions related to AMD64. The x87 FPU has been deprecated by both AMD and Intel in favor of SSE2. At one point during the development of the 64-bit kernel of Windows XP, Microsoft even wanted to throw it out altogether, but kept for backwards compatibility reasons (still considered deprecated, meaning Microsoft doesn’t recommend using it). AMD and Intel encourage developers to use SSE (and now also AVX) instruction set extensions wherever possible. AMD mentions that in their software optimization guides for the K8 and up (Software Optimization Guide for AMD64 Processors page 83). Intel stresses on numerous occasions in chapter 5 and 6 of their ?Intel® 64 and IA-32 Architectures Optimization Reference Manual? to prefer SSE and AVX over x87 whenever possible. Intel even states in the release notes of their compiler on page 13 "All Intel® 64 architecture processors support Intel® SSE2." Larrabee invalidates that statement.

Tailoring an application for Larrabee is not really less work than going the GPGPU route with CUDA or OpenCL. In any case, the code needs to be tailored for the specific hardware (even though OpenCL promises to be platform agnostic, in practice you need to optimize for specific architectures to get meaningful performance). Ignoring performance characteristics and availability, these technologies are roughly on equal grounds. If we factor in that CUDA and to a lesser degree OpenCL have a head start on the market, things start to look dire for Intel.

The only advantage of Intel?s compatibility claim that remains is the familiar x86 programming model. Any programmer that has experience with Intel?s other SIMD ISA extensions should have no problem learning the new instructions. On the other hand, starting to get into CUDA or OpenCL is quite a different beast. Intel claims that this low level programming shouldn’t even be necessary since any code written in a high-level language such as C or C++ can be fed to their compiler to produce Larrabee compliant code. In the end it boils down to the quality of the compiler.

While there is already a supercomputer installation up and running using Intel’s MIC placed #150 on the Top500 list, broad shipments of the accelerator card should only start in the second half of 2012, with no specifics given. This could be anytime between now and December. Other than the statements ?more than 50 cores?, ?capable of supporting over 8GB GDDR5 memory? and ?manufactured with 22nm 3D transistors?, Intel didn’t give many specifics about Xeon Phi. Last year they lined out their bold roadmap for the product.

It remains to be seen whether Xeon Phi can be successful the way it was designed. As we revealed quite some time ago, the way to develop this product has been rocky at best. We don’t intend to spell doom over Larrabee with this analysis, we merely wanted to point out something that should be known to better understand how to compare it to GPGPU computing.