Our application requires a large number of independent least squares minimizations, which result in QR decompositions of varying size, generally not more than 200x100. This is the low end of performance if we try to utilize the whole GPU. We would like to invoke multiple magma_sgeqrf each occupying one thread-block. Can magma be configured to run kernels in this way?

Will,This is a type of problem that I also have been wondering how to resolve. We have been asked for similar problems, e.g., many small LU solvers, eigen-solvers, etc. I find the idea of invoking multiple magma_sgeqrf, each occupying one thread-block, feasible but it may require some work. In particular, the BLAS involved has to be configured to use one thread-block. This would not be that difficult because we have the sources for the BLAS needed. Also, some experimentation may be needed to see if the panels still can be factored on the CPU (as in MAGMA). It may turn out that the GPU would not have enough work and some panels may have to be scheduled on the GPU, which would require that a GPU panel factorization is developed (for one thread-block). These are my initial thoughts on this. I would be interested to know if you tried something and how it worked. Meanwhile I can also check with our Berkeley collaborators if they have something ready to distribute through MAGMA - I know they have a paper on a communication-avoiding QR which would have required the development of similar to what you need kernels (http://www.cs.berkeley.edu/~mjanders/files/IPDPS.pdf). Regards,Stan

I am only now getting back to this subject. A student wrote a CUDA implementation to perform QR on multiple (small) matrices with variable size, with each matrix mapped to a separate thread block. The performance increases dramatically from 1 to 10 matrices and reaches an asymptote at about 50 matrices. So we know the concept is good for our case.

How much work would be involved to modify MAGMA to assign independent matrices to thread blocks? Would it be viable to give the problem to a graduate student as a thesis project? We might be able to find a willing student.

There is some overlap with the Tall-Skinny QR mentioned in the Berkeley paper. However, in there case the matrices at least have a constant number of columns. It would be interesting to know whether their TSQR could somehow be adapted for the variable size case.

I am also thinking of formulating the problem as a block diagonal matrix, filling the rest of the matrix with zeros and use the existing MAGMA QR. This is very wasteful, but due to the great performance for large matrices, this might still be an intermediate solution.

The problem can also be formulated as solving n independent systems A^T \times A. But I suppose that the Cholesky kernels will not run on a thread block any more than the QR kernels can.

What performances did you get with your student's implementation ? Is the source code puclic ?

I have been wondering how to get a "multi-batch" version of eigen-solvers (this is one topic of my thesis). Matrix size is between 60x60 to 128x128 but i've got thousands of these to compute. As this operation comes after many FFTs that perform really great on GPU, all these matrix are already on GPU's memory, and I don't have to worry about transfert times.

I still didn't go deep into this but I was thinking of using the concurrent kernel feature, each kernel doing an evd on a different matrix.

Will,This sounds good! I will be interested to hear more about it, especially since more and more users are requesting support for this type of problems.

How much work would be involved to modify MAGMA to assign independent matrices to thread blocks? Would it be viable to give the problem to a graduate student as a thesis project? We might be able to find a willing student.

One way to modify MAGMA quickly would be to add interfaces where the user would provide as input the computational streams that a routine should use. The user then can create several streams and run the independent factorizations through them. This will not involve changes in the MAGMA sources but a single routine may not be constrained to just a single multiprocessor - based on the matrix size it can use several multiprocessors; the rest though would not be idle as they would pick up work from the other streams.Another way would be to enforce the use of a single multiprocessor for a single problem. This may require some coding. I guess the first approach would be better for "medium" size problems, and the latter for "small" (compared to the default blocking size for gemm, which is for example $64$ for dgemm on Fermi; the problem that Jeff mention probably would be considered small). I find this will be interesting for a thesis project - it can involve a combination of algorithm design and code optimization, experimentation, and probably can expand in the problem area from where the matrices come.Stan

So the extreme might be taken into account in any planning and design that is done, the problem I am working on includes dsyevd and dsygvd of quite a large number (to 10,000 in near future) of 5x5 and 3x3 matrices which can be done in parallel. For this size of problem, a thread block per matrix would seem too large a processing unit. One GPU thread per matrix would seem a better match, though a new scheme for data layout might be needed for coalesced access.

I just finished integrating a blocking factor into the QR code, so now we can look at systems of very large size.

The code is part of CUSP, and therefore it is publicly available. It however been written by a student and is not very clean. The performance reaches an asymptote of about 35 GFlop/s for 60 matrices (each 600x600, which is too large to be a realistic test case). So the absolute performance is not high probably because the underlying QR is not optimized. This is exactly why we would like to use MAGMA instead. But with all these caveats, if you are interested, let me know.

Thanks for your answer. The first part (the user would provide the computations streams as input) would be an entirely reasonable we to go. We only constrained the problem to one multiprocessor was for simplicity's sake. So if I understand correctly, since cublas_v2 allows streams already, it is a matter of upgrading MAGMA to cublas_v2 and then passing a streams argument through MAGMA. Or is there more to it?

As for the second possibility of enforcing a single multiprocessor per matrix, we may still pursue this. I will add this to the list of possible student projects.