Hi, I'm using CUDA 5 with Magma with a K20 GPU. I have a 1024x1024 which I'm calling magma_zgesv_gpu on. I'm trying to see if I can speed performance up with using multistream. I guess I have a couple of questions: - How do I know if the GPU is already busy with this setup such that the additional streams will not help? - I'm doing the following:

magma_zgesv is a synchronous function, as are most magma functions, whereas magmablas functions are asynchronous. (I assume magma_zgesv is what you call inside magmaWrapper.SolveAXequB.) So it won't currently run multiple solves in parallel. Since part of the factorization happens in CPU code, you would actually need multiple CPU threads to even possibly run multiple solvers in parallel. Even that is not yet supported. -mark

Yes indeed I call the magma_zgesv inside the SolveAXequB function.Is changing the zgesv to be able to run conncurrently (CPU and GPU) is something which is planned?How hard do you think it would be to do this myself?And maybe a more basic/important question would be - do you think it would benefit, performance wise,to do so? i.e. do you think that if I'll be able to run multiple solvers on a K20 card with a 1024x1024 matrices,I'll see performance gains over serial runs?

oh, will I be able to run 2 solvers if I use, for example, a two GPU system? or the CPU part would again preventthe code from truely run in parallel and solve two sets of equations at the same time?

I'm not sure whether or when support for running multiple parallel solves will be added. Partly this is because a large enough matrix will completely occupy the GPU, so there would be no performance benefit. For smaller matrices, there may be some benefit, but it may need a more specialized interface to pipeline the operations effectively.

For multiple GPUs, if you call magma_zgesv from separate CPU threads, they ought to run in parallel. But I've never tried it. There may also be issues with multi-threaded BLAS on the CPU side. For instance, if you have 12 cores and set MKL number of threads to 12, then will the multiple panel factorizations conflict with each other and over-subscribe the CPU cores?

So, basically, it's on our agenda to look at it, but we don't as yet have all the answers. -mark

As you can see, for small n like 512, doing multiple gemms in parallel using streams does improve the overall performance. But as n increases, it quickly attains nearly the whole peak speed, especially for double-complex. So I would not expect significant improvements using streams to solve multiple problems in parallel for n > 1000, even less for double and complex.

Hi Mark, Thanks a lot for the detailed benchmark. I do have another followup question though.

I'm using the magma_zgesv_gpu function. As far as I understand it first calls the magma_zgetrf_gpu (LU factorization) and then the magma_zgetrs_gpu which finally calls some version of the zgemm. If I run the code for one call in one stream via the profiler I see the attached image

magma_zgesv_gpu profiler output

MagmaOut.jpg (281.99 KiB) Viewed 2503 times

As far as I understand the left side (of the memcopies) is the magma_zgetrf_gpu code, which has many copies from host-to-device and device-to-host, CPU code and a lot of code on the cpu. All happening syncorniously. Then the synced memcopies and then on the right the magma_zgetrs_gpu code, which looks better utilizing the gpu (at least from how it looks in the profiler, no memcopies in the middle and the kernels seems to be running oneright after the other).

Now I have the following questions (I guess those should be looked more like thoughts and obviously not complaints whatsoever :) ): - I guess that if you did the measurements you did in the previous post on the whole process (zgetrf_gpu and zgetrs_gpu) we'll see that the peak performance for one stream would drop significally and hence maybe multiple streams (though not possible today because of your earlier explainations in previous posts) would boost performance. - Since the code is synchronizing the device, it will probably also be hard to run "other" (magma/custom) kernels while the magma code is run to achieve concurrency for other tasks. - The kernels, for both zgetrf and zgetrs, seems to be using low number of blocks in each kernel launch. This might also indicate low utilization of the GPU (at least for the LU part, though the solver itself also uses a small grid). - For kepler, my target gpu, the gpu seems extremely ideal for this flow, it could have been nice if we could use streams and/or dynamic parallelism to boost it.