I'm testing a CUDA framework that I'm developing and I'd like to use MAGMA to implement a blocked Cholesky factorization (so, I need the dpotrf, dtrsm, dgemm and dsyrk kernels). For my experiments, I need these functions to be asynchronous and I need to be able to configure the CUDA stream where they are launched (as I will later synchronize using CUDA events). Ideally, the kernels should run exclusively on the GPU.

However, I read in a previous post that dpotrf function is synchronous because it's partially run on the CPU and that setting the CUDA stream in MAGMA is not thread safe. Is this true for the latest MAGMA release?

I know I can use CUBLAS for dtrsm, dgemm and dsyrk (run asynchronously in the CUDA stream that I set), but I also need dpotrf... Is it possible to use MAGMA kernels in the way that I need it?

Yes, most MAGMA functions including dpotrf are hybrid -- they do some work on the CPU, namely the panel, dpotf2 -- so they won't be asynchronous. There is a Cholesky panel in MAGMA, magma_dpotf2_gpu, which runs completely on the GPU. You could use that to build a dpotrf that runs completely on the GPU. Panel operations tend to be slow on the GPU, though, so it may cause the entire factorization to be slower. You can set MAGMA's stream beforehand, but if you have other threads that are also setting MAGMA's stream, currently you would need to modify dpotf2 to pass in a stream to be thread-safe. Eventually, the MAGMA API will change to have a stream passed into each function. -mark