I've made some modifications to src/dgetrf_gpu.cpp such that the explicit H2D and D2H data transfers are being performed asynchronously with streams. With this addition, the routine should be able to execute asynchronously between concurrently executing OpenMP threads (my matrix is moderately small at about 160x160 per thread).

The problem is that there are still some synchronous data transfers being performed, as revealed by the CUDA command-line profiler (COMPUTE_PROFILE=1):

The overhead from this one portion of the routine dominates the cost in terms of CPU time, and prevents me from asynchronously executing on the other threads. I'm fairly certain the source of these transfers is in the call to magmablas_dpermute_long2s:

I have attempted to replace the call to dlaswp in magmablas_dpermute_long2 with the magmablas_dlaswp2 routine, which uses a device copy of the pivots (ipiv), but I have not been able to successfully implement this, if it's even possible.

Does anyone have some suggestions on how to proceed, given that I need each CPU thread to be able to call dgetrf_gpu completely asynchronously from the other CPU threads?

The current code uses stream 0 for the GPU BLAS. This would not allow concurrent BLAS execution on the GPU from the different threads.

Related to the communications, I think magmablas_dpermute_long2s does not use synchronous communications. The routine does not have any explicit communication, only the arguments get sent implicitly by CUDA. In particular, we pack a number of pivots in a structure and this structure is passed to the kernel. The structure, similar to any other argument, is passed from the CPU to the GPU asynchronously by CUDA: I think when a kernel is called from the CPU, the CUDA compiler inserts code that is preparing the arguments (possibly packing them in some contiguous data), queuing the task for execution, and exiting. Thus the call is asynchronous - the control is passed back "immediately" to the calling thread, and a CUDA background thread makes sure the arguments for the task queued are sent to the GPU and execution is started on the GPU.

We wrote the magmablas_dlaswp2 so that we have only one kernel call. The pivots here though are assumed to be on the GPU so you have to copy them first from the CPU to the GPU. This is possible to do (we have done it, but have not release a version with it yet).

I will be interested to see some performance results using this approach. Thanks.

Thanks for your reply. I'll have to make my response brief, as I wrote a very detailed response, only to have it lost when I attempted to submit it.

In the implementation I speak of, I have replaced the call to magma_dgetrf_gpu with magma_dgetrf_gpu_v2, which accepts a unique CUBLAS handle its associated CUDA stream as an argument from each calling CPU thread. Within the routine, all magmablas routines are replaced with a variant, which accepts the stream as an argument and performs the GPU operations on that stream. All CUBLAS routines are replaced with their _v2 equivalents, and passed the necessary handle so that they should all be executing asynchronously. All memory transfers are replaced with _async equivalents as well.

Perhaps the arguments being implicitly sent by CUDA are only done asynchronously under certain conditions. From what I could gather, it seems that the memory transfer can only be performed asynchronously if the data on the host is allocated with pinned memory. Maybe something like the following could work:

I'm a bit of a novice with C, so forgive my ignorance, but I do not know how to allocate the structure as pinned memory, since there is no explicit "malloc". Is it just that whatever is passed to the structure must be pinned (in this case ipiv)?

I am very interested in the use of the magmablas_dlaswp2 routine. At this point, it may just be best to wait for this implementation, or develop my own. Do you know when/if this is planned to be released?

As far is performance is concerned, I am seeing significant improvement in my code using the non-default streams, but I am waiting for a fully-functional implementation before I do any extensive performance testing.

Austin,I see. This sounds good. We have been asked by users to provide this type of stream interface, so any experimental results on performance would be very useful for us to know.

I can check with NVIDIA developers if routine arguments are always sent asynchronously or if there are cases that they get sent synchronously.

To use magmablas_dlaswp2 you can allocate ipiv in pinned memory on the CPU, and precede the magmablas_dlaswp2 call by asynchronous CPU to GPU copy of the corresponding pivoting indexes from ipiv (on the CPU) to some dipiv (on the GPU). Regarding putting arguments (in this case the structure with pivots) in pinned memory, I don't think it matters, just because CUDA does not use that memory directly to send data to the GPU - they get copied into intermediate buffer (which I assume they allocate in pinned memory at CUDA installation time). Stan

At this point, I have a working modification to dgetrf_gpu.cpp which uses streams to allow for asynchronous execution between OpenMP threads (aka multicore). Each thread creates a unique context (aka handle), and associates its stream with its handle. It's undoubtedly a naive implementation using streams, and I have not yet optimized the magmablas routines to account for the change in the availability of GPU resources. As a first attempt at further optimization, I also managed to implement a version with each thread spawning two streams and using this to overlap some of the communication/computation on each CPU core.

As a quick test of performance, I modified the testing_dgetrf_gpu routine to parallelize the inner "niter" loop with OpenMP. Allocating/setting device or pinned memory enforces device synchronization, so I moved these operations outside of the dgetrf_gpu routine itself and pass the dAT, dAP, ipiv, and work variables as arguments. Even with my naive implementations, the desired effect is achieved:(Tests performed with AMD "Interlagos" Opteron 6274 + NVIDIA Tesla K20X Kepler GK110 with one thread per FPU. The problem size was chosen to be 150 to mirror the application. A block size of 128 was used.)There's some diminishing returns as I increase the number of threads, which is something I hope I can remedy, maybe through batching or something of the sort.