Parallel

Will Intel’s Knights Corner chips function as co-processors like GPUs, or will they be stand-alone many-core Linux systems? The two approaches present very different performance profiles.

Convergent Evolution in HPC: Intel MIC and NVIDIA GPU

Five years ago, NVIDIA disrupted the high performance computing industry with the release of CUDA in February 2007. In combination with low cost of teraflop/sec (single-precision) GPU hardware, NVIDIA brought supercomputing to the masses along with co-processor acceleration of both C and Fortran applications. With the MIC announcement, Intel has followed suit along this convergent evolutionary path.

While similarly packaged as a PCIe device, Intel has taken a different architectural approach to massively-parallel computing hardware. The KNC generation of MIC products appears to be HPC- oriented, which means high-end customers can now choose from two types of teraflop/sec capable PCIe-based co-processors.

Comparing NVIDIA GPUs and Intel MIC

GPU designs utilize many streaming multiprocessors (SM) where each SM can run up to 32 concurrent SIMD (Single Instruction Multiple Data) threads of execution. The current generation of Fermi GPUs supports 512 concurrent SIMD threads of execution that can be sub-divided into 16 separate SIMT (Single Instruction Multiple Thread) tasks. The upcoming NVIDIA Kepler GPUs will support even greater parallelism. For example, the GTX 680 will support 1,536 concurrent SIMD threads.

Teraflop/sec. performance is achieved through a per-SM hardware scheduler that can quickly identify those SIMD instructions that are ready-to-run (meaning they have no unresolved dependencies). Ready-to-run instructions are then dispatched to keep multiple integer, floating-point, and special function units busy. A per-GPU hardware scheduler similarly allocates work (via CUDA thread blocks or OpenCL™ work-groups) to ensure high utilization across all the SM on a GPU.

High flops/watt efficiency is realized through the use of a SIMD execution model inside each SM that requires less supporting logic than non-SIMD architectures. GPU hardware architects have been able to capitalize on this savings by devoting more power and space to 64-bit addressing, additional ALUs, floating-point, and Special Function Units for transcendental functions. Some reviewers report that NVIDIA expects Kepler to deliver "about 3x improvement in [double precision] performance per watt …" over Fermi.
Other notable characteristics include:

Data-parallel operations are spread across the SMs of one or more GPU devices.

MPI jobs are accelerated by using one or more GPUs per process and capabilities like GPUdirect, which optimizes data transfer into device memory.

Intel MIC

The Intel MIC architecture in the KNC chip utilizes x86 Pentium-based processing cores that support four threads per core. According to The Register, the next generation Knights Corner has "64 cores on the die, and depending on yields and the clock speeds that Intel can push on the chip, it will activate somewhere between 50 and 64 of those cores and run them at 1.2GHz to 1.6GHz". The preceding implies that each KNC chip will provide between 200 and 256 concurrent threads of execution.

Teraflop/sec floating-point performance can be achieved when enough of the SMP threads issue special SSE-like instructions to fully utilize an enhanced vector/SIMD unit that resides on each core. (Note: this requires the use of a special "-mmic" compiler switch to tell the Intel compilers to look for cases when these MIC-specific vector instructions can be utilized, or via hand-coding with intrinsic operations.)

High flops/watt efficiency is realized by leveraging the simplicity of the original in-order short execution pipeline Pentium design and the power savings of chips created with their 22 nm manufacturing process. MIC also derives high flops/watt from using wide vector units. The logic for the Pentium core is small relative to modern processor cores, which left room for additional logic to support 64-bit addressing, four concurrent threads per core, and a large 512-bit wide vector unit. Per the TACC Stampede announcement, the initial revision of KNC per-core vector unit will deliver 50% higher floating-point performance in 2013.
Other notable characteristics include:

Data-parallel tasks appear to be mainly accelerated by the per-core vector units.

Task-parallelism is accelerated by running a task per thread and separate tasks on the device(s) and host processor.

MPI jobs are accelerated by using one or more MIC devices per process or capabilities like MIC-as-a-compute-node discussed later in this article.

NVIDIA GPU

Intel MIC

Degree of Parallelism

Fermi supports 512 concurrent SIMT threads of execution. Kepler will triple this number to 1,536 threads.

Knights Corner expected to support between 200 and 256 concurrent threads.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!