I see that magma has SGEEV for computing the eigenvalues/vectors of a matrix, however, it uses a mix of cpu and gpu routines to achieve this.

I need to work out eigenvalues+vectors entirely on the GPU without assistance from the CPU as to avoid any device-host memory transfers (in addition to what I already have). I am working in a real-time video application and have to calculate this every frame.

Could anyone suggest the easiest approach to do this.

I have tried using the cula tools, however, they do not run asynchronously to the gpu memory copies and I cannot control the stream cula functions use.

My current thinking is to try and adapt the SGEEV code by creating kernels to replace any calls to CPU bound functions, however, any suggestions would be appreciated as this seems like a large undertaking.

We have generally found that hybrid CPU-GPU implementations out-perform pure GPU implementations. The portions that are assigned to the CPU, e.g., the panel factorization and QR iteration, have more complicated control flow and less available parallelism, making them perform poorly on the GPU. Portions of the panel factorization are overlapped with updates on the GPU, to try to hide CPU computation and CPU-GPU memory copies. This works well in LU factorization, but for the Hessenberg reduction used in the eigenvalue routines, it is much harder to hide all of the CPU computation.

If you are still intent on doing the entire computation on the GPU, for the geev I suggest first looking at the Hessenberg reduction in zgehrd, to see if you can implement the panel factorization (in zlahr2) completely on the GPU and achieve good performance. The Hessenberg reduction has its own testing routine in the testing directory.

The eigenvalue calculation is only a small part of my algorithm and I'm only doing it on a comparatively small matrix say 256x256. I was wanting to complete it all on the GPU as the host->device transfer is occupied 100% already with the transfer of my raw image data (512x512x256 per image).

Given that the eigenvalue calculation is very small in terms of complexity and size to the rest of the algorithm I was thinking I might get away with relatively low performance of my custom routine anyway. I just need the actual values on the GPU without interrupting the rest of the processing chain with memory copies given the rest of the calculation needs to run on the gpu as it involves the full image data set.

It's also a symmetric problem (it's part of PCA, probably should've mentioned that) so I realized I should probably be using the symmetric eigenvalue function too (SSYEVD I think it is?). Sadly, i'm not very familiar with BLAS and LAPACK functions.

I was thinking that I might get away with making a more-naive QR Algorithm kernel, as even it it was slower it might still be small compared to other things in the algorithm. Do you have any experience with how something like that might pan out.