Parallel

Getting to 1 Teraflop on the Intel Phi Coprocessor

By Rob Farber, March 12, 2013

The key to truly high performance with the Phi coprocessor is to express sufficient parallelism and vector capability to fully utilize the device. Here is a timing framework that enables you to measure and optimize performance and push it past 1 teraflop.

Definition of the PCA/NLPCA Vector Function

For flexibility, myFunc.h uses the C preprocessor to include the code for the objective function from the file fcn.h. Note that both linear and nonlinear G() functions are defined via C preprocessor macros in the myFunc.h source code. Example 6 is the full source listing for the 2x1x2 autoencoder that includes all the preprocessor defines.

As noted previously, this code makes heavy use of the fused-multiply-add instruction so it should provide high-floating point performance. Loops were avoided because they did interfere with vectorization. (Loop unrolling pragmas did not appear to help.) Further, the code for an autoencoder is very nice for benchmarking purposes because all computations occur in vector registers once the in[] vector is read from main memory.

Example 7 is a Python code generator that provides a convenient way to generate arbitrary nInput x Nh3 x nh3 x NH3 x nInput autoencoders. By varying the size of the autoencoder, the reader can adjust the number of flops per byte fetched from memory from low to very high, as well as investigate the size of program that can be utilized before instruction fetch overhead dominates the runtime. By default, this Python code generates a 2x10x1x10x2 autoencoder.

The Timing Framework

Example 8 is the complete source code for the timing code that calls the objective function. This framework creates a user-specified-size dataset with random numbers. In offload mode, the ALLOC preprocessor define is used along with an in() clause to move the data to the device. The code then fills the parameter vector with random numbers and calls the objective function a user-specified number of times. The user-reported timing statistics are created by calling the objective function a user-specified number of times.

A simple sanity check is provided, which verifies that the objective function returns the same result during the timing run. An overall timing check is also performed to see if the granularity of the measured wall clock time has affected the timing results.

Building the Timing Framework

Save the source files myFunc.h and fcn.h in a subdirectory. (For example, the previous source listings should be saved in the directory simpleExample. Keep the source code for timing.c and genFunc.py in the top-level directory).

Example 9 is a shell script builds the OpenMP native and offload versions of the timing function in the simpleExample directory:

Simply by removing the preprocessor definition –DUSE_LINEAR, this same code will compile to an NLPCA run using the x/(1+|x|)sigmoid, or S-shaped curve. This activation function is handy for timing purposes because we know how many floating-point operations, but as noted in my article, "A Better Activation Function for Artificial Neural Networks," it might actually take more optimization steps to reach a solution when solving real problems. It is not clear how many floating-point operations are required to calculate a tanhf() or expf(). By default, the code in myFunc.h assumes seven floating-point operations per call. The article "Test-driving Intel Xeon Phi Coprocessors with a Basic N-body Simulation" notes the actual number of floating-point operations performed by the expf() call varies.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!