I understand MAGMA algorithms are hybrid CPU + GPU. I have a realtime application in computer vision which needs to solve an Ax=b (A is 6x6 symmetric positive definite in double precision) linear system multiple times per frame at 30fps. The application is running mostly on the GPU at the moment but the solution of this linear system is performed on the CPU with Eigen, involving hundreds of GPU -> CPU -> GPU memory copies every second.

I would like to see what the performance of the application would be like if I could solve the linear system exclusively on the GPU (and completely eliminate GPU - CPU memory copies) even if the solution to the individual linear systems exclusively on the GPU are much slower than on the CPU it might be made up for by eliminating the CPU-GPU memory copies and kernel launches etc.

If MAGMA can't help here, do you have any ideas of another implementation that might or ideas how to go about custom coding it using lower level libraries (CUBLAS maybe?) for the 6x6 case? As you can probably tell I'm not an expert on Linear Algebra so if I haven't been clear enough please let me know.

MAGMA does not currently have this capability, to solve many small matrices entirely on the GPU.

Given that your matrix is SPD, you can use a Cholesky factorization, which is nice because it has simpler control flow than the general LU factorization. No pivoting is required. Depending on the size and number of matrices to be solved simultaneously, either a single thread or a single block could do each factorization.

When you say it is solved multiple times per frame, can those multiple times be in parallel, or does the result of one solve become an input for a subsequent solve?

Also, is the matrix changing for each solve, or you just have different right-hand sides to solve with? If the matrix keeps changing, you have to re-factor each time, as I assumed above. If just the right-hand side changes, then you could factor once (even on the CPU) and just use cublasXtrsm( ) twice to solve entirely on the GPU.

"MAGMA does not currently have this capability, to solve many small matrices entirely on the GPU."

Does it have the capability to solve a single small matrix entirely on the GPU? I'm not sure if I was clear enough in my original post, my application flow requires the matrices to be solved individually one at a time. I don't need to solve many at a time on the GPU, just one at a time where the input data resides on the GPU and the result is provided on the GPU without any memory copy to, or involvement of, the CPU.

No, it can't currently factor a matrix entirely on the GPU. The current hybrid code always use both the CPU and the GPU to factor the matrix. For matrices of this small size (n=6), it would actually end up factoring it entirely on the CPU, even if the GPU interface was used. -mark