Database

Programming Intel's Xeon Phi: A Jumpstart Introduction

Reaching one teraflop on Intel's new 60-core coprocessor requires a little know-how

The following command-line arguments are required to set the operating mode:

-no-offload: Ignore any offload directives

-offload-build: Create offload regions according to the directives in the source code.

-mmic: Build the executable for MIC. Linking is also required to the libiomp5 library in this mode.

In addition, the –mkl command-line option tells the compiler to utilize MKL while the –std=c99 directive allows use of the restrict keyword and C99 VLAs.

The following scripts were used to survey the runtime behavior when multiplying square matrices ranging from 500 to 11,000 elements for each of the three scenarios:

OpenMP with MKL running locally on the host.

Listing Seven is just a conventional shell script that loops from i=500 to 11,000 in increments of 500. The environment variable MKL_MIC_ENABLE is set to zero to ensure that the MKL sgemm() runs on the host.

In this example, the environment variable MKL_MIC_ENABLE is set to 1. This tells the MKL library to run on the coprocessor. In addition, this script sets the variable MIC_ENV_PREFIX to indicate that all environment variables intended for the Intel Xeon Phi will be preceded by the string "PHI_". Per the previous discussion, a balanced thread affinity will be used.

Listing Eight: Survey script to run MKL on the device (and OpenMP on the host).

Listing Ten runs on the device and assumes that libiomp5.so and the required Intel Xeon Phi libraries have been copied to /tmp. The LD_LIBRARY_PATH variable is set so these libraries can be found by the Intel Xeon Phi when the binary is loaded. Further, this script assumes the matrix.mic executable has been copied to the directory where the script will run. Per the previous discussion, the thread affinity is defined to be balanced.

Of special interest in the host offload script is the use of PHI_USE_2MB_BUFFERS environmental variable that controls the use of large pages. By default, the runtime system allocates memory in 4KB virtual memory pages. These small pages can cause a performance degradation with some algorithms such as matrix multiplication due to TLB misses (that is, misses in the cache management). Setting PHI_USE_2MB_BUFFERS (when MIC_PREFIX=PHI) tells the runtime to allocate heap variables whose size is greater than the value specified by PHI_USE_2MB_BUFFERS in 2MB pages. According to this Intel MIC memory presentation, consider setting this environment variable when your code utilizes a 16 MB or larger memory with a large data structure that has heavy semi-random element accesses. Native applications should use mmap or hugetlbfs to allocate memory with large pages.

Briefly, the use of the larger 2MB pages can benefit many algorithms such as matrix multiplication. In virtual memory architectures like the x86 processors, the TLB provides an on-chip cache to improve the speed of virtual address translation. When a page entry is in the TLB, application addresses can be translated to physical RAM addresses with minimal overhead and no additional RAM accesses. While TLB caches are fast, they are also small and the overhead and performance penalty incurred by a TLB miss is significant. The importance of larger pages for floating-point dominated application can be understood by considering any array operation  however trivial  that requires stepping through memory in strides that are greater than the standard page size used by the system. These are common scenarios that occur frequently when working with two- and higher dimensional matrices. Because of the stride size, each memory access requires looking up a new page in the TLB. If the array is sufficiently large, then each memory access can cause a TLB miss and corresponding performance drop. Utilizing larger page sizes results in fewer TLB misses which will increase application performance because the processor does not have to wait (or wait as long) for data.

Setting the number of threads on the host and device can be confusing when running in offload mode. The matrix.c source code uses the OpenMP omp_set_num_threads() method to control the number of threads in the code when running in native and host modes. The previous offload script uses the host PHI_OMP_NUM_THREADS environment variable to specify the number of threads used by the device in offload mode. Sumedh Naik, from Intel, provided the following code snippet to illustrate how to set the number of threads at runtime on both the host and device in offload mode.

Listing Eleven: Setting the number of threads on both host and device at runtime.

Figures 6 and 7 show how the runtimes scale according to matrix size on a preproduction 1.5 GHz Intel Xeon Phi coprocessor containing 61 cores compared with a 12-core Sandy Bridge X5680 running at 3.33GHz (marked Host OpenMP). The algorithm utilized in the doMult() example code is clearly not as efficient as the one used in the MKL library. For this reason it is unfair to directly compare the performance of the OpenMP code against the MKL library. The value of the MKL performance numbers is that they show the performance capability of the Xeon Phi hardware.

The MKL library results demonstrate that a preproduction Intel Xeon Phi coprocessor can easily deliver over 1 teraflop/sec of single-precision performance. In comparison, the host implementation of the MKL sgemm() function appears to saturate the x5680 processors and/or memory subsystem when multiplying matrices larger than 2000x2000. In offload mode, the MKL library takes responsibility for moving the data to and from the device. For this reason the MKL offload results are particularly interesting because they demonstrate that it is possible to achieve more than a teraflop/s of performance even with the overhead required to move data between the host and device.

Figure 6: MLK sgemm() performance results compared with Sandy Bridge.

In Figure 7, the matrix.c OpenMP code delivers the best performance when running natively on the Intel Xeon Phi. Even with the overhead of transferring the A, B, and C matrices on every call to doMult(), the offload performance still exceeded that of the native host OpenMP code. This is expected as matrix multiplication performs O(N2) calculations per datum (for example, floating-point element) transferred. Matrix multiplication is used as a common example to demonstrate coprocessor and accelerator programming for this very reason. Unfortunately, algorithms that exhibit such a lack of sensitivity to data movement are desirable but not very common. As later tutorials will show, a large amount of coprocessor application design time will be spent on minimizing data transfer costs.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!