Continuing the discussion here, although there are now some other issues:

I am currently running a large random single precision real matrix problem -- M=71710 N=71710 LDA=71712 -- and the computation continues until at one point the program SEGFAULT in a call to cudaStreamCreate():

No errors are indicated by any of the check_error calls in the other stream handling functions. Could it be that there is some memory leaks in the CUDA library and that this is presenting itself here? What, in your experience, is the limit for the number of streams that can be used with one device?

I have added debug output to the magma queue creation functions, and get the following output:

We've used at least 16 streams before. (CUDA lets you create more than 16, but effectively uses only 16 at a time, according NVIDIA's presentation.)

To speed up debugging, you could try commenting out the magma_sgemm and other magma BLAS calls. That eliminates computation but will do the same memory and stream allocations. Of course, if the error is a bad memory reference in a BLAS call, that will hide the error (which may also be a clue to which kernel is stepping on memory).

After disabling the sgemm and strsm calls, I end up with a launch failure in magmablas_spermute_long3. I know that this is supposed to swap the columns of the transposed matrix, but the exact working is a little tricky to piece together. Is there a paper or something that describes the algorithm being used here so that I can try to find the exact source of the error?

========= Invalid __global__ write of size 4========= at 0x00000d10 in ../../magmablas/stranspose-v2.cu:50:stranspose3_32========= by thread (0,5,0) in block (13,1,0)========= Address 0x20682d280 is out of bounds=========

When running cuda-gdb with cuda memcheck on, it shows that the error is on the following line:

When using the MAGMA_NGR_NB environment variable to ensure that the memory for dAT (which I have moved to a separate allocation) does not exceed 4GB, then the cuda-memcheck errors disappear. Could it be that somewhere 32bit pointers are being used internally? I am using the -m 64 nvcc flag for compilation, so I would not expect this to be the case.