Tuesday, January 8, 2019

The combination of the IBM® POWER® processors and the NVIDIA
GPUs provides a platform for heterogeneous high-performance computing
that can run several technical computing workloads efficiently. The
computational capability is built on top of massively parallel and
multithreaded cores within the NVIDIA GPUs and the IBM POWER processors.
You can offload parallel operations within applications, such as data
analysis or high-performance computing workloads, to GPUs or other devices.

OpenMP API specification provides a set of
directives to instruct the compiler and runtime to offload a block of
code to the device. The device can be GPU, FPGA etc. In the new generations of POWER architecture, the POWER processor can be
attached to the Nvidia GPU via the high speed NVLINK for fast data
transfer between CPU and GPU. This hardware configuration is an
essential part of the CORAL project with the U.S. national labs and can
bring us closer to the Exascale Computing. The IBM XL compilers has a
long history of supporting OpenMP API starting from the first version of
the specification. The XL compilers continue to support OpenMP
specification and exploit the POWER hardware architecture with GPU. The
XL compiler team works closely with the IBM Research team to develop the
compiler infrastructure for the offloading mechanism. In addition, the
team also collaborates with the open source community for the runtime
interface on the GPU device runtime library.

The OpenMP program (C, C++ or Fortran) with device constructs is fed into the High-Level Optimizer and partitioned into the CPU and GPU parts. The intermediate code is optimized by High-level Optimizer. Note that such optimization benefits both code for CPU as well as GPU. The CPU part is sent to the POWER Low-level Optimizer for further optimization and code generation. The GPU part of the code is translated to the LLVM IR and then fed into the LLVM optimizer in the CUDA Toolkit for optimization specific for Nvidia GPU and PTX code generation. Finally, the linker is invoked to link the objects to create an executable. From this outline view, one can see that the compiler employs the expertise from the both worlds to ensure that the applications are being optimized accordingly. For the CPU part, the POWER Low-level Optimizer which accumulates many years of optimization knowledge on the POWER architecture generates optimized code. For the GPU part, the GPU expertise from the CUDA Toolkit is used to generate optimized code on the Nvidia device. As a result, the entire applications are optimized in a balanced way.

The XL C/C++ V13.1.5 and XL Fortran V15.1.5 compilers are one of the first compilers that provide support for Nvidia GPU offloading using OpenMP 4.5 programming model. This release has the basic device constructs (i.e. target, target update and target data directives) support to allow users to experiment the offloading mechanism and porting code for GPU. The other important aspect of offloading computation to devices is the data mapping. The map clause is also supported in this release.