Parallel

Getting to 1 Teraflop on the Intel Phi Coprocessor

By Rob Farber, March 12, 2013

The key to truly high performance with the Phi coprocessor is to express sufficient parallelism and vector capability to fully utilize the device. Here is a timing framework that enables you to measure and optimize performance and push it past 1 teraflop.

Generating an Autoencoder

The genFunc.py code can be used to generate a more complicated autoencoder that provides more floating-point operations per byte fetched from main memory and will get closer to peak performance on the Phi coprocessor. Copy the simpleExample directory to auto2x10x1x10x2 and generate the new fcn.h source code shown in Example 11.

The PCA version can be built by simply changing the command-line argument to the BUILD_LINEAR_TIMING script in Example 12. The nonlinear version can be built with the BUILD_NONLINEAR_TIMING script.

sh BUILD_LINEAR_TIMING auto2x10x1x10x2

Example 12: Using the BUILD_LINEAR_TIMING shell script to build an autoencoder.

Runtime Results

To facilitate testing, the PCA_TIMING and NLPCA_TIMING shell scripts always create binaries of the same name. It is important to pay attention to what binary is being used and to the output that reports the function being evaluated. The binaries created are:

timing.omp: an OpenMP executable that runs on the host processor

timing.mic: a native mode Intel Xeon Phi coprocessor executable

timing.off: an offload mode executable that will run on both the host and Intel Xeon Phi coprocessor

Running a linear PCA function using the timing code in auto2x10x1x10x2 in offload mode generates the output in Example 13.

The output tells us that we are running a linear PCA function. In this example, the data transfer to the Phi coprocessor achieved 257 MB/second. This timing information should be reliable as the average runtime is consistent between both overall and perCall measurement methods. The Phi coprocessor utilized 240 threads.

The key floating-point metrics show that average performance is 912 GF/second. The fastest offload runtime achieve nearly a TF/second. However, there is almost a factor of 3x difference between the fastest and slowest runtimes even though the code performed 10 warm-up runs.

The timing executable can run natively on the Phi coprocessor after copying the executable and libiomp5.so to the device plus setting the LD_LIBRARY_PATH correctly (or the micnativeloadex script could be used or keep one window open on the Intel Xeon Phi coprocessor, see Example 14.)

Even though the number of threads is the same, the same code running in native mode was on average 1.173x faster than offload mode in Example 15. Offload mode can be nearly as efficient as native mode when the time spent performing the computation is large relative to the latency of the data transfers on the PCIe bus. The runtime difference will increase as the problem size decreases.

In addition, the operating system jitter discussed in the first article appears to be the cause of much of this variation. (An excellent starting paper on this topic is "The Case of the Missing Supercomputer Performance.") In offload mode, small latencies as the device driver moves the parameters onto the Phi and the single floating-point error estimate off the device decrease performance.

Running the OpenMP version on a 12-core 3.3 GHz Intel X5680 Westmere chipset shows the linear code runs on average 8.5x slower than the offload code and 10x slower that the native mode; see Example 16.

Examining fcn.h in auto2x10x1x10x2 shows that the processor core is performing a large number of dot products that utilize the fused multiply-add instruction. Changing to a nonlinear function illustrates the impact of adding a division and absolute value to the calculation with the Elliott activation function, G(x) = x/(1+|x|), as can be seen in the following table where the timing programs were built with NLPCA_TIMING script:

The key point is that the Intel Xeon Phi coprocessor in native mode runs the nonlinear problem with a performance comparable to the offload mode. This indicates that the runtime is dominated more by the computation runtime rather than latency-limited operations such as the summation spinlock and PCIe data transfers. Note that there is still significant variation between minimum and maximum performance. Performance profiling with Intel's profiler, VTune, will help identify the reasons for these performance changes.

Conclusion

Peak performance is a useful marketing metric that condenses the complexity of any machine from a cellphone to a leadership-class supercomputer into a single number that people can easily grasp and categorize. While peak performance has its place, sophisticated performance competitions such as the TOP500 and GRAPH500 attempt to more realistically evaluate system performance for specific problem domains.

The key to entering the high-performance arena with the Phi product family is to express sufficient parallelism and vector capability to fully utilize the device. Optimized libraries such as MKL can achieve very high performance. Matrix multiplication is a useful computational tool that also makes a great benchmark because it can show how close a device can get to peak theoretical.

The massively parallel mapping utilized in this article has proven to be an excellent framework for solving real-world problems, as a teaching tool, and as a performance evaluation tool. The autoencoder objective functions used in this tutorial solve real-world PCA and NLPCA problems, yet they can also be modified to stress either the memory subsystem or floating-point capability of a device. It is also possible to define an autoencoder architecture that is not limited by memory bandwidth or computation, but rather by the synchronization required to perform a reduction on a parallel computer. The heavy use of the fused multiply-add instruction means that it is possible to fully utilize the floating-point capability of some devices, and achieve high-performance across a wide range of devices. The near-linear scaling of this mapping means that you can run it with high-performance on a single device or on a supercomputer over a wide range of problem domains.

I encourage you to explore the Intel Xeon Phi coprocessor performance envelope through the use of the provided Python code generator and by writing your own functions. My next article will demonstrate that these objective functions can indeed solve real optimization problems with high performance.

Notes

The current source code needs to be compiled with the older 13.0.0 Intel compiler. While the code can be compiled with the more recent compilers (13.0.1 and 13.1.0), care must be taken that the loop in the objective function vectorizes.

Rather than generating myFunc(), it is more convenient to write a single function that loops over the connections between neurons in different layers. Using loops in myFunc() appears to prevent vectorization and results in a significant performance drop. Unfortunately, loop unrolling does not appear to help.

The article "Optimization and Performance Tuning for Intel Xeon Phi Coprocessors Part 1: Optimization Essentials" is a useful reference for high-performance Intel Xeon Phi coprocessor programming. It notes that alignment of the vectors with __declspec(align(64)) is important. Utilizing -vec-report=6 when compiling confirms that the values are aligned. The Intel article also notes, "Code will run best when data are accessed in sequential address-order from memory. Frequently, developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays (AoS to SoA)." Users can test the performance effects of AoS versus. SoA by changing the IN() macro.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!