Parallel

Numerical and Computational Optimization on the Intel Phi

By Rob Farber, March 19, 2013

How tuning functions for large data sets and profiling the results gets most of the benefits of the Phi's 60 cores without hand wringing and late-night hacking.

This is the third installment in a series on understanding and using the new Intel Xeon Phi coprocessors to create applications and adapt legacy software to run at high performance. The previous article in this series, Getting to 1 Teraflop on the Intel Phi Coprocessor, showed how to write an objective function that delivers an average performance exceeding a teraflop in native mode and 900 gigaflops second in offload mode on a single Intel Xeon Phi coprocessor.

This article will employ these objective functions to solve numerical nonlinear and linear optimization problems. The Intel VTune profiler will be used to examine the runtime of the optimization code and help gain insight into the Phi coprocessor's performance envelope. In the process, I provide full working source code for a real-world example that achieves performance comparable to the best observed performance with the optimized MKL matrix multiplication method shown in the first article in this series. The intention is to show that the performance potential of Phi hardware is accessible to programmers and to provide a working example that can be modified so you can explore the device performance envelope in both native and offload programming modes.

The key to entering the high performance arena with the Phi product family is to express sufficient parallelism and vector capability to fully utilize the device. After that, other device characteristics (such as memory bandwidth, memory access pattern, the number and types of floating-point calculations per datum, plus synchronization operations) determine how close an application can get to peak performance when running on all the Intel Xeon Phi cores.

It is important to note that even though this article focuses on least squares objective functions for linear and nonlinear principal components analysis, the mapping itself is generic and has applicability to a wide range of numerical problems.

Building the nlopt Library

This article and the rest of the series uses the open-source, freely available nlopt optimization library, although most numerical methods that call a user-specified function to optimize will work. As I note in my book CUDA Application Design and Development, it is important to utilize numerical methods that require the objective function to return only a single floating-point value. Techniques that utilize an error vector limit the scalability of parallel and distributed implementations by imposing data bandwidth and memory capacity hardware restrictions.

Nlopt can be built on Linux by downloading the source code from the nlopt website. Version 2.3 is currently the latest. The nlopt software is well designed and can be easily built from source because it uses the standard Linux autoconf software.

A host-processor-based version for offload CPU optimizations can be built using the Intel icc compiler with the following commands:

nlopt can also be cross-compiled to run on the Phi coprocessor by adding -mmic to the compiler flags environment variables. Note the Phi libraries need to be linked, and so the installation directory name has been changed to reflect the device architecture to prevent conflict.

A Least Squares Objective Function with Data I/O

In general, most data is gathered experimentally or culled from digital archives. For ease of use, the source code for myFunc.h is modified to include a function init() that reads data from either a file or from stdin. For the purposes of this article, the linear and nonlinear genData.c programs provide working examples on how to generate the data sets used here The reader is free to load their own data from observation or other data generators to test and solve their own computational problems.

The format of the data file comprises a header that defines the size of the input vector in the data set, the size of the output vector, and the number of examples (or observations) in the data set. The header is followed by the input and output vector of each observation.

Listing One is the complete revised source code for myFunc.h presented in the previous article with the addition of an init() function that can read the data from either stdin or a file. The use of stdin means the data can be piped to the training code when running natively on the Phi. Thus, the data does not need to consume precious memory on the Phi RAM file system, or introduce the setup complexities, performance limitations, and potential to exacerbate operating system jitter by using an NFS file system between the Phi coprocessor and the host system.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!