No, I wouldn't expect it to take that long. It queries the CUDA devices for their architecture (Fermi, Kepler, ...) and possibly makes a queue (stream) on each GPU. Assuming that you include magma_v2.h, instead of magma.h, you can try adding -DMAGMA_NO_V1 to CFLAGS in make.inc and recompiling MAGMA. That would disable creating any streams in magma_init.

What GPUs do you have? E.g., what is the output of a MAGMA tester (as below)?

I was surprised these were that large. On any particular machine, the results varied a bit from run to run, perhaps 20%. Looking deeper, nearly all the time in magma_init is taken by cudaGetDeviceCount. If I put a different CUDA call, like cudaGetDeviceProperties, prior to cudaGetDeviceCount, then cudaGetDeviceProperties takes all the time and cudaGetDeviceCount takes negligible time. I surmise it is overhead in loading and initializing CUDA that happens on the first CUDA call. You can probably use CUDA's profiler to see this overhead.