Benchmarks: Intel Xeon Phi vs. NVIDIA Tesla GPU

Accelerators battle for compute-intensive analytics in Finance

At Xcelerit, people often ask us: “Which is better, Intel Xeon Phi or NVIDIA Kepler?” The general answer has to be “it depends,” as this is heavily application-dependent. But what if we zoom-in on real-world problems in computational finance? The kinds of problems that quants in investment banks and the financial industry are dealing with every day. Let’s analyse two different financial applications and see how they perform on each platform. To cover different types of algorithms often found in finance, we chose an embarrassingly parallel Monte-Carlo algorithm (with full independent paths) for the first test application, and a Monte-Carlo algorithm with cross-path dependencies with iterative time-stepping for the second.

Replaced absolute times with speed-ups vs. sequential for better readability

Intel Xeon Phi 5110P

NVIDIA Tesla K20X GPU

Note: We chose to compare just one instrument pricing for each algorithm, as these are at the core of many performance-critical applications used in banks. For example, large batches of many thousands of instruments are priced at once, instruments are valued under many different risk scenarios, or prices are updated in real-time. As each pricing is independent, the performance typically scales linearly. In any case, reducing the execution time of a single pricing as much as possible is paramount.

As can be seen, the features and architecture of the processors are very different, and from the theoretical GFLOPs and memory bandwidth, both the Xeon Phi and Tesla GPU should be similar in performance. However, as we’ll see later, theory and practice often diverge and for a given application the performance depends on many other factors.

For example, to fully utilize both memory and processor, and in the absence of cache, the Sandy-Bridge CPU would need to perform 52 floating point operations for each 8-byte memory operation. Anything less would make the processor wait for data from memory, i.e., the application would be memory-bound. On the Xeon Phi, 25.3 operations per memory operation need to be performed, and the Tesla GPU needs 42.2 operations per memory access. If there are more operations than that, the application becomes compute-bound. However, with the presence of on-chip caches, this picture changes completely as data kept in cache can be accessed much faster than external memory. Performance also becomes less predictable, and it cannot be determined theoretically which processor is best for a given application and whether it is compute- or memory-bound.

Further, applications are typically composed of sequential parts and parallel parts, and the overall application performance is heavily influenced by the fraction of the sequential part. If only half the application can be parallelised, the maximum achievable speedup by parallelisation is 2x, though it will be less than that in practice (Amdahl’s Law).

So let’s put these processors to the test for real-world applications.

Test Setup

To allow for a fair comparison of the processors, we’ve used the low-level programming tools and techniques to tune the application performance to the maximum for each platform, such as OpenMP threading, vectorization pragmas and attributes, hand-tuned CUDA kernels, native libraries (CUBLAS, MKL, etc.), and went through multiple iterations of profiler-assisted performance tuning. Thus there are three different code bases for each application, all hand-tuned. For the Xeon Phi, the applications have been executed natively on the co-processor (not using offload-mode).

Application 1: Monte-Carlo LIBOR Swaption Portfolio Pricer

Algorithm

Details of this algorithm have been previously described. For convenience, we briefly summarise it here:

A Monte-Carlo simulation is used to price a portfolio of LIBOR swaptions. Thousands of possible future development paths for the LIBOR interest rate are simulated using normally-distributed random numbers. For each of these Monte-Carlo paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. The equations for computing the LIBOR rates and payoff are given in Prof. Mike Giles’ notes. Furthermore, the sensitivity of the portfolio value with respect to changes in the underlying interest rate is computed using Adjoint Algorithmic Differentiation (AD). This sensitivity is a Greek, called λ, and its computation is detailed in the paper Monte Carlo evaluation of sensitivities in computational finance. Both the final portfolio value and the λ value are obtained by computing the mean over all per-path values. The algorithm is illustrated in the following graph:

Performance

The algorithm has been executed on all three platforms, in double precision, for varying numbers of paths. The portfolio consisted of 15 swaptions, simulated for 40 time steps. For reference, we’ve also included a straightforward sequential C++ implementation, running on a single core of the Sandy-Bridge CPU. The results are listed in the table below:

Paths

Sequential

Sandy-Bridge CPU1,2

Xeon Phi1,2

Tesla GPU2

128K

13,062ms

694ms

603ms

146ms

256K

26,106ms

1,399ms

795ms

280ms

512K

52,223ms

2,771ms

1,200ms

543ms

1 The Sandy-Bridge and Phi implementations make use of SIMD vector intrinsics.

2 The MRG32K3a random generator from the cuRAND library (GPU) and MKL library (Sandy-Bridge/Phi) were used.

For this application it can be clearly seen that NVIDIA’s Tesla GPU outperforms both other platforms significantly, being 5.1x faster than the multi-core dual Sandy-Bridge CPU and 2.2x faster than the Xeon Phi (512K paths). The Xeon Phi is 2.3x faster than the Sandy-Bridge. Moreover, compared to the sequential implementation, the optimized Sandy-Bridge is 19x as fast, the Phi is 43.5x as fast, and the Kepler GPU is 96x as fast.

This algorithm is embarrassingly parallel with completely independent Monte-Carlo paths, which suits parallel accelerator processors very well. The full application is parallelisable with no sequential parts and very little synchronization is required. In addition, the algorithm is clearly compute-bound, with a substantial amount of math and relatively little memory access. This is ideal for GPUs and a similar performance can be expected for other Monte-Carlo simulations with similar characteristics.

Application 2: Monte-Carlo Pricing of American Options

Algorithm

This application prices a vanilla American put option using a Monte-Carlo simulation. In contrast to European options, which can only be exercised at maturity, American options can be exercised at any time. This poses an additional complexity for Monte-Carlo pricers, as the option’s value for early exercise at each time step needs to be evaluated and compared to the expected value when not exercising it at this step. This is typically solved with a Longstaff-Schwartz algorithm, which involves computing regression coefficients across all paths to go from one time step to the previous one. Thus, the algorithm walks iteratively backwards in time, starting from the final time step, and involves a regression across paths. The Monte-Carlo paths are not independent and all time steps need to be solved iteratively, with parallelisation opportunities only within each step and for the initial asset price generation. The final price is the average of all paths at time step zero. The algorithm is illustrated in the following graph:

Performance

The performance has been measured on all platforms for different numbers of Monte-Carlo paths, using 256 time steps each and 3 regression coefficients. We’re giving speed-ups vs. a sequential reference implementation running on a single core of the Ivy-Bridge processor. The results are:

Paths

Ivy-Bridge CPU1,2

Xeon Phi1,2

Tesla GPU2,3

128K

41.8x

20.9x

52.9x

256K

48.5x

31.0x

71.8x

512K

45.2x

39.7x

86.6x

1 The Ivy-Bridge and Phi implementations make use of OpenMP and vectorization pragmas/attributes as much as possible.

2 The MRG32K3a random generator from the cuRAND library (GPU) and MKL library (Ivy-Bridge/Phi) were used.

3 Many small CUDA kernels need to be executed on the GPU, as parallelisation can only be done within each time step

The Tesla GPU is about twice faster than the Xeon Phi, and between 1.2x and 1.9x faster than the CPU. The difference between CPU and GPU performance is significantly less as for the LIBOR swaption pricer above, and for 128K paths the results are comparable.

This outcome can be explained by the iterative nature of the algorithm and by the heavy memory operations involved. The CPU is optimised for general-purpose workloads, has larger caches, and can solve iterative problems very well.

Note: There is an approximate version of a Monte-Carlo pricer for American options that is more suited for parallel architectures which is expected to give better performance on both Xeon Phi and Tesla GPU (Glasserman, 2003: “Monte Carlo Methods in Financial Engineering”). It computes the regression coefficients in an initial step, using a much smaller number of paths, and applies these in a full simulation over all paths later. This full simulation does not require the regression step, i.e., each Monte-Carlo path is fully independent, and can therefore be fully parallelised across all paths. However, this method is controversial among quantitative analysts and gives slightly different results. Thus, we did not include it in these benchmarks.

Conclusions

We’ve seen that there is one processor that needs to be added to the picture — the commodity multi-core CPU. This is already a part of many server configurations, and for some applications, e.g., Monte-Carlo pricing of American options, it can give better or comparable performance than an accelerator processor when optimized correctly. Between NVIDIA’s Kepler GPUs and Xeon Phi, the GPU wins for both of our test applications.

However, the results are close and we can expect this picture to change for other applications. Further, the Xeon Phi is brand-new (released 2013), while NVIDIA Tesla range is around since 2007 — the GPU is thus a more mature accelerator platform.

Hand-tuning the code for all three platforms for the highest performance requires significant expertise, time, and a deep knowledge of the target hardware. One way to side-step this effort is to use the Xcelerit SDK. With just minor modifications to existing code, performance equivalent to manually optimized code can be achieved without any hand-tuning. What’s more, a single code base can then run on multi-core, GPU, or any supported hybrid configuration.