I am using MAGMA to do QR factorization on multiple GPUs. the problem I am facing is that the time it took to factorize on 4 gpus is the same as the time to factorize on 1 gpu. I am using a systme with 4 Tesla gpus. any hints about what I may be doing wrongI am using this function "magma_dgeqrf2_mgpu". I also got the same performance when I used the example in testing folder "testing_dgeqrf_mgpu.cpp"

Can you post the complete input & output of the tester, including what the environment variables OMP_NUM_THREADS, MKL_NUM_THREADS, and MAGMA_NUM_GPUS are set to, if you set any of those? For instance, see below.

Also, your make.inc file would be helpful, and any environment variables you set for that, such as GPU_TARGET.

If by "ubuntu blas" you mean the libblas3 ubuntu package, that will be exceedingly slow. Try with libopenblas-base, which is an Optimized BLAS (linear algebra) library based on GotoBLAS2. Or get ATLAS (libatlas3-base), which is another optimized BLAS library. Ideally, you would use a multi-threaded BLAS library.

Again, it would be helpful to have your make.inc file and your complete input (command line and relevant environment variables), as well as the output, as I showed.