Parallel

Numerical and Computational Optimization on the Intel Phi

How tuning functions for large data sets and profiling the results gets most of the benefits of the Phi's 60 cores without hand wringing and late-night hacking.

Fitting a PCA autoencoder using native mode

The following commands will utilize the train_pca.mic executable to fit the data. Note that the data is piped to the executable to preserve precious onboard RAM resources. The variable DEV can be modified to run on any Phi coprocessor in the system. In this example, DEV is set to mic1. The scp command is used to transfer data and results between the Intel Xeon Phi coprocessor and host. It is assumed that the libiomp5.so shared object file was previously copied to /tmp on the device.

Figure 6 shows the performance of the linear 2x10x1x10x2 autoencoder-based objective function as the data set size varies and according to processing mode. As can be seen, the Phi coprocessor quickly outstrips both the offload and the 3.3 GHz Westmere x5680 dual socket host processor in both offload and native modes.

Figure 6: Performance of a linear 2x10x1x10x2 PCA autoencoder according to size, machine, and mode.

The performance of the offload mode gradually improves as the latency and bandwidth limitations of the PCIe bus are overshadowed by the runtime of the objective function. It is expected that performance of the offload mode will improve with time; especially as this is the only way to utilize multiple devices within a system or as MPI processes in a compute cluster.

A Nonlinear Principal Components Optimization

While PCA utilizes straight lines, NLPCA can utilize continuous open or closed curves to account for variance in data. As a result, NLPCA has the ability to represent nonlinear problems in a lower dimensional space. NLPCA has wide applicability to numerous challenging problems including image and handwriting analysis, biological modeling, climate and chemistry.

Building and Running the PCA Analysis

An NLPCA analysis can be performed by changing the definition of the G() function in myFunc.h by simply editing the "-DUSE_LINEAR" flag in the build script. The source code was designed to make this process as easy as copying the pca directory to an nlpca directory and editing the BUILD script. Here is the complete BUILD script for the nlpca directory.

These commands comprise the BUILD script, which will create the following applications:

gen_nlpca: Generates the nlpca data set.

train_nlpca.mic: The native mode training application.

train_nlpca.off: The offload mode training application.

train_nlpca.omp: A training application that will run in parallel on the host processor cores.

pred_nlpca: The sequential prediction program that will run on the host.

The Elliott activation function, x/(1+|x|) , used in this article is nice for timing purposes because we know how many floating-point operations it requires. Unfortunately, this activation function, as noted in "A Better Activation Function for Artificial Neural Networks," may require more optimization steps to reach a solution than more-conventional activation functions when solving real problems. Several conventional activation functions, tanh() and the logistic function, can be enabled by simply changing the definition of G() with a preprocessor define. For timing purposes, each call to tanhf() orexpf() is assumed to take seven floating-point operations. Note that this is only a guess because the number of instructions required for each of these functions varies. However, you can use this code to experiment with different activation functions.

Fitting an NLPCA Autoencoder Using Offload Mode

The following bash script is nearly identical to RUN_OFFLOAD for the pca directory, but the modified script will create an NLPCA data set of 30,000,000 observations generated with a variance of 0.1 that the offload mode train_nlpca.off executable will fit. A 1000 point prediction set with zero variance will be used for prediction purposes. This is identical to the size and character of the pca runs. The UNIX tail command is used to strip off some informative messages at the beginning of the prediction results save in the file plot.txt to make it easy to graph the final result. The original results are kept in the output.txt file.

The output of the NLPCA training when running in offload mode on the Intel Xeon Phi coprocessor follows. Note the objective function was called 109,197 times and it delivered on average 342 gigaflops of performance.

Comparison of resulting graph (Figure 7) shows that the optimized autoencoder did find a reasonable-looking fit to the data shown in Figure 6.

Figure 7: Offload mode NLPCA line prediction.

VTune Performance Analysis

After building the train_nlpca.off executable with the –g flag, running amplxe-gui, performing a Hot Spot analysis limited to one minute, the CPU usage is in the ideal range and myFunc consumes most of the runtime. (The G() function also constitutes a hot spot.) As with the PCA timeline, most threads start and compete at the same time and many seem to fully occupy the processing core. The dot products still consume a significant amount of runtime; even a simple G() function requires small instructions and data movements. VTune highlighted the assembly language instructions associated with the appropriate line of C source code (in this case, a small number of instructions to perform the Elliott activation G() function). However, this operation does not perform two operations per clock, hence, it slows overall performance.

Fitting an NLPCA Autoencoder Using NativeMmode

The following commands utilize the train_nlpca.mic executable to fit the data. As with the pca run, the data is piped to the executable to preserve precious onboard RAM resources. The variable DEV can be modified to run on any Phi coprocessor in the system. In this example, DEV is set to mic1. The scp command is again used to transfer data and results between the Phi coprocessor and host.

Figure 8 shows the performance of a 2x10x1x10x2 autoencoder based objective function using the Elliott activation function as the data set size varies. Multiple surveys from using the host Westmere processor, and the Intel Xeon Phi coprocessor operating in offload and native modes are shown on this graph.

As can be seen, native mode performance on the Phi coprocessor quickly outstrips both the offload and the 3.3 GHz Westmere x5680 dual-socket host processor.
The performance of the offload mode gradually improves as the latency and bandwidth limitations of the PCIe bus are dominated by the runtime of the objective function. It is expected that performance of the offload mode will improve with time; especially since this is the only way to utilize multiple devices within a system or as MPI processes in a compute cluster.

Figure 8: Performance of a 2x10x1x10x2 NLPCA autoencoder according to size, machine, and mode.

Conclusion

This article demonstrates how to combine Phi coprocessor-based objective functions with existing numerical optimizations libraries to solve real problems with high performance. The freely available nlopt library was built to run on the Intel Xeon Phi coprocessor in both host and offload mode. The objective functions discussed in Getting to 1 Teraflop on the Intel Phi Coprocessor were used to fit example data sets in both native and offload modes, while still delivering performance in the 300-to-teraflop-per-second performance range. A survey across problem sizes was performed to get a sense of how offload mode compares with native execution and against a 24-core 3.3 GHz Westmere processor set. Small problems in particular performed nicely on the Phi coprocessor due to the elimination of latencies associated with the PCIe bus.

The Intel VTune performance analyzer confirmed that the application was effectively using the Intel Xeon Phi wide-vector instructions. Thread utilization across all the cores was excellent. The use of the VTune analyzer allowed us to examine hotspots and the memory bandwidth behavior of complex functions.

Finally, I encourage you to explore the Phi coprocessor performance envelope through the use of the provided Python code generator and by performing your own optimizations. The software framework in this article is general enough so that Phi coprocessors can be integrated into existing analytic workflows.

Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!