Parallel

Programming Intel's Xeon Phi: A Jumpstart Introduction

By Rob Farber, December 10, 2012

Reaching one teraflop on Intel's new 60-core coprocessor requires a little know-how

Demonstration: Scalability to 120 Threads Is Recommended

Listing One is a C source code snippet that implements doMult(), a function that multiplies two square matrices A and B and assigns the result into matrix C. This function will be used to quantify the performance impact of the number of threads utilized per core.

Readers familiar with OpenMP are comfortable using pragmas to annotate their code so it can be parallelized by an OpenMP-compliant compiler. Listing Two is a complete source code with test harness to demonstrate the average native (Intel Xeon Phi as a Linux SMP computer) runtime performance.

The firstMatrix.c source code can be compiled to run natively on an Intel Xeon Phi coprocessor as an OpenMP application with the Intel C Compiler (icc) command shown in Listing Three. The –mmic argument specifies native compilation for Xeon Phi.

The KMP_AFFINITY environment variable specifies the thread-to-core affinity. There are three preset schemes: compact, scatter, and balanced. Intel recommends the user explicitly define the affinity that works best for their application. The reason is that the default runtime thread affinity can change between software releases. For consistent application performance across software releases, do not rely on the default affinity scheme.

Compact tries to use minimum number of cores by pinning four threads to a core before filling the next core

Scatter tries to evenly distribute threads across all cores

Balanced tries to equally scatter threads across all cores such that adjacent threads (sequential thread numbers) are pinned to the same core. One caveat being that all cores refers to the total number of cores -1 because one core is reserved for the operating system during an offload

The runtimes in Figure 4 show that the first performance peak is around 120 threads, or 2x the coprocessor core count. (Remember that one core is reserved for the operating system.) The highest and broadest performance peak is observed around 240 threads, or 4x the core count.

Figure 4: Average GFlop/s as a function of thread count when multiplying 1000x1000 matrices.

Figure 5 illustrates the variation in runtime performance with 240 threads is clearly larger than the variation with 120 threads. However, an average over 10 samples shows that the minimum observed runtime does not dramatically affect the overall average runtime (marked with the green tic mark).

Figure 5: Variation in runtime in native mode.

Be aware that operating system jitter due to system daemons and multiple user processes can introduce performance variations. (An excellent starting paper on this topic is "The Case of the Missing Supercomputer Performance.") In particular, the default round-robin scheduling with multiple devices can introduce performance variations.

Offload Programming

The offload pragma in Listing One, provides additional annotation so the compiler can correctly move data to and from the external Phi card. Note that multiple OpenMP loops can be contained within the scope of the offload directive. The clauses are interpreted as follows:

• offload: The offload pragma keyword specifies that the following clauses contain information relevant to offloading to the target device. target(mic:MIC_DEV) is the target clause that tells the compiler to generate code for both the host processor and the specified offload device. In this example, the target will be a Xeon Phi device associated with the integer specified by the constant MIC_DEV. Note that:

The offload runtime will schedule offload work within a single application in a round-robin fashion, which can be useful to share the workload amongst multiple devices. It is the responsibility of the programmer to ensure that any persistent data resides on all the devices when round-robin scheduling is used! In general, only use persistent data when the device number is specified or bizarre errors can result. Note that the use of persistent data on the device is required by the three rules of high-performance computing to avoid PCIe bottlenecks.

The offload runtime will utilize the host processor when no coprocessors are present and no device number is specified (for example, target(mic)).

Alternatively, programmers can add use _Offload_to to specify a device in their code.

The length(element-count-expr) specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements. By default, memory will be deallocated on exiting the scope of the directive.

The free_if(condition) modifier can change the default behavior.

More information about the syntax of the offload directive is available from Intel.

Note that the call to doMult() utilizes the variable size to dynamically specify at runtime the number of columns in the 2D matrices. The ability to index contiguous memory through variable length multi-dimensional arrays (such as, 2d, 3d, and so on) arrays was added to the C programming language in the ANSI C99 specification. This feature is important because the offload pragmas transfer only contiguous regions of memory. Old-school C programmers have been trained to manually calculate the offset for each multi-dimensional array access from the start of a contiguous memory region. This article and the upcoming installments use the newer C99 VLA (Variable-Length Array) feature to make the examples easier to read, potentially enable more compiler optimizations, and also achieve high data transfer performance as each multidimensional array can be transferred in one operation. For compatibility reasons, it is also important to list the variables used in the multidimensional array declarations first in the calling sequence because some compilers (such as the Intel compiler) do not forward variable references within an argument list.

More information about the Xeon Phi offload syntax is available from Intel.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!