We are experimenting a trial implementation of the OpenMP accelerator extension in OpenMP 4.0 specification. It is not meant to be the official or complete implementation due to the fast changing nature of the specification and our resource limitations.

The implementation is based on the existing OpenMP implementation in ROSE.

At the time the work being done, CUDA and/or nvcc does not have linker for device code (linking __global__ and _device_ functions from different files). We have to put some common device functions into this file. So they can be included in the same file in which the generated CUDA kernels are generated (__global__ functions can only call __device__ functions from the same file).

We provide two loop schedulers in the runtime so each CUDA thread may get more than 1 iterations to handle large iteration space.

a static even scheduler: this scheduler evenly divides up the iteration space into roughly equal sized chunks. It then assigns each chunk to a CUDA thread. This scheduling policy may stress the memory too much since each thread will touch a large range of data brought in by the iteration chunk

a round-robin scheduler: Each thread grabs one iteration (or more iterations) at a time. It is identical to OpenMP's schedule(static, chunk) policy.

Testing use of the schedulers (using corresponding CPU versions) are given at: