I'm using magma_sgeqrf_gpu to do QR factorization on a large matrix, and the function does not appear to free cuda memory after it is done. My code tries to allocate more GPU memory after the magma_sgeqrf_gpu call but it fails stating that the GPU is out of memory. I tried using the cudaMemGetInfo function to see how much memory is left over after the call, but it gives a segmentation fault when I run it. The function works when it is called before the magma_sgeqrf_gpu function call. Has anyone encountered this GPU memory leak in the magma_sgeqrf_gpu function and is there a solution?

I just learned that the magma_sgeqrf_gpu function is returning the error MAGMA_ERR_HOST_ALLOC. This occurs where the function tries to allocate pinned memory. I might try changing to this to a regular host memory allocation and see if that allows the allocation to succeed.

How big is your matrix? How much CPU and GPU RAM do you have?It's very surprising to have malloc_pinned fail -- seems to imply that you are running out of physical RAM to hold the CPU workspace, which is only (m + n + nb)*nb, not the whole m*n matrix. It would be helpful to try to replicate the problem using the magma tester, testing/testing_sgeqrf_gpu, and provide the complete input & output of the tester here.

From a cursory examination, magma_sgeqrf_gpu does not appear to allocate any GPU memory (other than for streams/queues), since the matrix was already passed in on the GPU.

for workl = (m + n + nb)*nb I'm getting (4847595 + 24 + 512)*512 = 2482200000; however, a regular int cant hold that many digits so the result comes out as -1812736512. I changed the int for "workl" to an unsigned long int but it still doesn't get through the function and I get a segmentation fault.

Several points.1) You can make magma_int_t into 64-bit. You need to link with an ilp64 BLAS and LAPACK library. See make.inc.mkl-ilp64.

2) MAGMA will probably not help with such a tall skinny matrix. MAGMA does a panel factorization on the CPU, followed by updating the trailing matrix on the GPU. The panel size depends on the matrix size, but is always >= 32. Since your entire matrix is less columns than that, it will do the entire factorization on the CPU and no work on the GPU. You could change the nb to something small like 8, but I think it would get poor performance. See control/get_nb.cpp.

You could transpose the matrix and then do QR, resulting in LQ^T of the original matrix. That should be fast with MAGMA. (Sadly, doing LQ of the transposed matrix won't help; MAGMA's LQ does a transpose and QR.)

3) PLASMA might be a better option for a tall skinny matrix, using multi-core CPUs. It has a hierarchical QR function to achieve parallelism.http://icl.cs.utk.edu/plasma/