I have seen strange error with zgetrf_gpu (wrong answer or segfault with memory calls after the routine) on Cray XK (Fermi+). The error occurs when (1) M=N, (2) M is multiple of 32, and (3) LDA is mulitple of 32, but not equal to M. I thought this could be a bug in zinplace_tranpsose. So I changed the source to call magmablas_ztransepose2, then the problem is fixed. I think zinplace_transpose source might have some bugs, but I haven't figured it yet.

Hi Keita,Thank you for this bug report. We were able to reproduce it and to fix the bug. The fix will be in the next release (beginning of May). It is for a case where we do an in-place matrix transpose. There are two calls to the transpose routine (in file zgetrf_gpu.cpp). The first one is:

Keita,Yes, we fixed the other versions as well. Actually, we generate them from the double complex version. I will go through the other LU versions as well, in particular the CPU interface ones, to see if they also need this fix.Stan

I am still seeing wrong answer if A and X are allocated in contiguous memory as described below. I suspect that zinplace_tranpose is making out-of-bound memory access when M!=LDA. In my application, I finally get the correct answer after changing the source to apply zinplace_transpose for LDA=M only.

It appears transpose routine looks OK. I am also looking into magmablas_zpermute_long2 to find any out-of-bound memory access.

Just for clarification. The error occurs when:M=3584LDA=3616Input matrix starts at A(32,32).

keitat wrote:I am still seeing wrong answer if A and X are allocated in contiguous memory as described below. I suspect that zinplace_tranpose is making out-of-bound memory access when M!=LDA. In my application, I finally get the correct answer after changing the source to apply zinplace_transpose for LDA=M only.

Keita,I managed to reproduce this bug as well. Thanks for reporting it. It was in magmablas_zpermute_long2. Now it is fixed in the SVN and we will make it available soon. Meanwhile, the fix is to call zpermute as

The bug was that in this particular case the code was permuting user data outside of the submatrix that is being factorized. We didn't have this issue in the CPU interface code and for this case we were only checking correctness for the factorization. Stan