I am calling the "zgesv" subroutine in CULA, namely, cula_zgesv with MPI. If I use one MPI process with one K20c card, it takes 0.7435620 seconds for a 3000*3000 matrix. However, if I use two MPI processes with two K20c cards, it takes 7.932089 seconds. It is ten times longer. For a 2000*2000 matrix, it takes 0.28 seconds and 5.6 seconds respectively. Why is there such a big difference? Is it a terriible bug or just I have done something wrong? In the following, I paste my code(It is a very simple code):

Since there are not any communications between the different MPI processes, I think the time should be approximately the same no matter ONE or TWO processes are usded. In fact, the time is the same when I call the "zgemm" subroutine in CUBLAS with one or two MPI processes.

So, could anybody tell me why I see such a problem in CULA but not in CUBLAS and how to deal with it? I have been confused about it for several months.

Last edited by xhsh on Sun Dec 22, 2013 12:55 am, edited 2 times in total.