Thanks for your reply! It is working now and I am really amazed by the speed. I attached the final code.

Quick question: I want to solve a linear system, but my X matrix is huge and rectangular and does not fit on the GPU. I use the magma_dgeqrf() function and could reconstruct the Q matrix myself using the elementary reflectors, but I need only the first N columns. How can I use the magma_dorgqr() function? There is the DT argument that confuses me since I did not use magma_dgeqrf_gpu().