Using time and difftime functions I figured out, that on matrix like 5000x5000 it takes 9 seconds to execute magma_dgetmatrix. Is it always so slow? Or the problem in my videocard - NVidia GeForce GT 424M? Distro - Debian Wheezy, I used disto's drivers.

The magma_dgetmatrix is a thin wrapper around cublasGetMatrix, mainly for platform independence and type checking. The performance issue may be your PCIe bus. It should be about the same time to do setmatrix as getmatrix.

Also be aware of timing asynchronous functions. The getri_gpu may be asynchronous (i.e., return before the GPU is finished), in which case the getmatrix would appear to be much longer because it has to wait for getri to finish. Best to do cudaDeviceSynchronize() before each timer call if you're not sure whether calls are async or not. For example: