Parallel

Getting to 1 Teraflop on the Intel Phi Coprocessor

By Rob Farber, March 12, 2013

The key to truly high performance with the Phi coprocessor is to express sufficient parallelism and vector capability to fully utilize the device. Here is a timing framework that enables you to measure and optimize performance and push it past 1 teraflop.

Example 2 implements the 2x1x2 simple autoencoder as a single myFunc() that can be used in a least squares objective function to perform both a PCA and NLPCA analysis. The type of analysis depends on the definition of the function G(). The C preprocessor macro IN() fetches the example values from memory while the code in the DO_PRED conditional compilation block allows use of this function for prediction and other applications.

Just as with matrix multiplication, this example also makes heavy use of the fused multiply-add instruction and should provide high
floating-point performance. The code for an autoencoder is also useful for benchmarking purposes because all computations occur in vector registers once the in[] vector is read from main memory. By varying the size of the autoencoder, the programmer can adjust the number of flops per byte fetched from memory from low to very high.

Using Persistent Data in Offload Mode

Walking through the myFunc.h code, note that ALLOC, FREE, and REUSE are defined using the C preprocessor as recommended in Ronald Green's, "Effective Use of Compiler Features for Offload." It's important to follow Intel's notes when using persistent data in offload mode: "Always remember to specify the card in case of multi-card environments. By default, the cards are accessed in a round-robin fashion. Hence the persistent data will not be available on the other cards." For this reason, the preprocessor MIC_DEV specifies the example code will run on card 0. This objective function utilizes persistent data when running in offload mode on the Intel Xeon Phi coprocessor. "Persistence" means that the data is loaded onto the device and kept there. This conforms to the three rules required to achieve high-performance on external processors like GPUs and the Intel Xeon Phi family. Succinctly stated, high offload computational performance with external coprocessors requires that the programmer:

Transfer the data across the PCIe bus to the coprocessor and keep it there

Give the coprocessor enough work to do

Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks

Unlike the previous offload examples in this series, the code in this article allocates and transfers the data to be fit into one routine (for example, main() discussed below), uses the data in objFunc(), and frees it in fini(). As Ronald Green notes, "Memory allocation is controlled by alloc_if and free_if, and data transfer is controlled by in/out/inout/nocopy. The two are independent, but data can only be transferred in and out of allocated memory." The allocation is controlled by the value passed to alloc_if(). Similarly, the memory is freed on the device as a result of the flag passed to free_if(). The C preprocessor defines make the code easier to read:

ALLOC: Allocate the data but do not free it (alloc_if(1) free_if(0))

REUSE: do not allocate or free the data (alloc_if(0) free_if(0))

FREE: Free the data (alloc_if(0) free_if(1))

The objFunc() method will be called many times, potentially hundreds of thousands to millions of times during a complex optimization. For this reason, only the optimization parameters contained in the vector x are transferred to the device on each call via the in() clause. (The offload mode assumes the optimization method runs on the host device per the mapping shown in Figure 2.) To avoid even the tiny transfer of the variable err that holds the sum of the squares error, the out() clause is used only to transfer the value off the device. Initialization occurs inside the offload pragma, on the device as noted in Example 4.

The pointer to the large example dataset is merely refreshed by specifying length(0), which avoids an expensive large data transfer per objective function call. The REUSE preprocessor define adds the appropriate alloc_if() and free_if() clauses, so the persistent data in example remains intact across multiple calls. More detailed information about refreshing device pointers and uses of the length parameter in the offload pragma clauses can also be found in "Effective Use of Compiler Features for Offload."

The example data is deallocated on the device with the following offload pragma in fini(). The in() clause refreshes the device pointer while the FREE preprocessor specifies the correct the alloc_if() and free_if() flags to deallocate the persistent memory on the device (see Example 5).

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!