Is it that transfering half of the array still takes about the same time as transferring the entire array?

If both threads transfer half the array at about the same time, then yes, it typically takes about the same amount of time as if one thread transferred the whole array. So your compute time should half, but the overall data transfer time will stay about the same.

If you can interleave data and compute, then you might be able to maximize the data bandwidth. Though this is tough to do in an OpenMP context given there's typically a tighter synchronization between threads. Eventually, you'll also be able to use the OpenACC async clauses which might help in interleaving, but unfortunately, we don't quite have async working well enough within OpenMP (hence the PGI_ACC_SYNCHRONOUS variable). Async works fine in a serial and MPI context though.

Could you clarify the usage of acc_set_device_num(devicenum,devicetype):

For the device number, are the GPUs numbered 0, 1, 2, ... or 1, 2, 3... I thought it was the former, but according to this link (http://www.catagle.com/26-23/pgi_accel_prog_model_1_2.htm) passing in 0 gives default behavior, not the first GPU. Is the CUDA Device Number, as displayed by pgaccelinfo, the number I need to input as my argument to get that device?

What does a devicetype of 0 or 1 do? (I didn't understand the documentation linked above).

For us, the default behavior is to use the lowest numbered device the binary will run. Typically this would be device zero, though could be something higher. The device information, including the numbering, can be found by running the "pgaccelinfo" utility.

For the the devicetype, you should use the enumerated names such as ACC_DEVICE_NVIDIA since the numbering may not be consistent between compilers. You can see the PGI list by viewing the header file "include/accel.h" (located in your PGI installation directory).

For example, if I'm running on nodes with 8 cores / GPU and every core runs an MPI process, they're all fighting over that 1 gpu.

It's a problem. Unless you have a K20, NVIDIA doesn't support multiple host processes (MPI) using the same GPU. It may work, it's just not supported. Even if it does work, then you've serialized the GPU portion of the code. This situation works only if the MPI processes use the GPU infrequently and not at the same time.

Hi Matt,

If you do have a K20 GPU is there anything special you need to do to make it possible for multiple MPI processes to make calls to the same GPU simultaneously? Or should it work automatically? (ANSWER EDITED IN BELOW!)

If you do have a K20 GPU is there anything special you need to do to make it possible for multiple MPI processes to make calls to the same GPU simultaneously? Or should it work automatically? (ANSWER EDITED IN BELOW!)