I have been exploring why my FORTRAN routine using zgetrs_gpu is so slow and I have some interesting results.

I decided to work in C++ to start with, and have adapted testing_zgetrf_gpu.cpp to go on and use magma_zgetrs_gpu and have timed the back substitution. I have also done the same for testing_dgetrf_gpu.cpp.

I have also put in comparative calls to the lapack routines lapackf77_zgetrs and lapackf77_dgetrs (which I had to add to the magma headers as they were not there).

Bear in mind that I am using gotoblas2 and running 4 cores on my CPU.

In each case I have wrapped the call in a call to get the timing and then report the value, so these are not flops.

This case is the one of interest in my other work. Again there is something odd about the 9024 result on lapack.

I have also done this including the transfer times, but they do not make much difference. The MAGMA routine is about 10 times worse than the LAPACK routine, which explains my problems with the case I have been working on.

zgetrs spends most of its time in ztrsm which I think is a CUBLAS routine, whereas you have done a dtrsm.

The double precision results are not too bad, but the double complex ones are amazingly unhelpful.

Is there something in your work programme on this? I think there should be a warning somewhere.

I will continue to look at zero copy to see where it can help, but in terms of my other problem I am back to what I called strategy 1, use MAGMA for zgetrf but not for zgetrs, unless I am missing something here.

I have continued this work as follows. I have converted the timings to GFlops for calculating the right hand side values.

I will report some values below but first the general conclusions.

1. The MAGMA dtrsm is much better than the CUBLAS dtrsm in all cases.2. There is not yet a MAGMA ztrsm so CUBLAS ztrsm is used.3. All the routines perform better when there are more right hand sides to be solved.

As in the problem I working with, the case is complex and the righthandsides are defined one at a time, I am in the worst of the worst case, for which LAPACK on the CPU is the quicker solution, unless I can find some way to reorganise the calculations.

I am wondering whether the one RHS case needs to be specially handled via ztrsv/dtrsv. This turns out to be a VERY good thought - see the end for the comparative results using CUBLAS dtrsv.

Please would you add dtrsv and ztrsv as well as ztrsm to your todo list.

Now some results. These show that the MAGMA routine is fine for the case of many righthand sides.

P.S. There are lots of other places where calls to Xtrsv can replace Xtrsm when the nrhs==1. There is almost always a gain. The only exception I have found is sgemv where the MAGMA strsm is so good that CUBLAS strsv cannot better it. The best gains are on complex cases, where trsm is still CUBLAS.

I will check this out when I get the new release, as it is central to one of my applications. It is complex, and so will show a big gain I expect if you have done zgetrsm, as we are still using the CUBLAS one for that in RC4.