I am trying to execute 2 totally independent matrix calculations at the same time using magmablas_sgemm. I am thinking of using CUDA streams, however in order to be able to do it I believe that magmablas_sgem calls should be aynchronous otherwise the focus won't be returned to the CPU therefore it won't be able to start the second magmablas_sgemm. Am I right? Would it be possible?

Both cublas and magmablas gemm calls are asynchronous. Actually, nearly all cublas and magmablas functions are asynchronous (but not higher level magma algorithms such as getrf). With recent performance improvements in CUDA 5.0, I actually recommend using cublas gemm.

Yes, you can use streams to execute multiple gemms simultaneously. This is helpful for small gemms -- specifically, where the output matrix (C) is small. For large gemms, each gemm will basically fill up the whole GPU, so there is no benefit to attempting to execute gemms simultaneously.

Thanks for the info, very useful.If it is possible to execute gemm with streams then, how can I indicate with which stream sgemm should be executed? The definition of magmablas_sgemm does not have anything for streams.I have to say that in our case, with matrices of 9000x24000 float elements magma behaves in similar fashion than cublas achieving similar speedups.

With the new cublas interface (cublas_v2.h), you set the stream on the cublas handle using cublasSetStream( ), then pass the handle to the cublas gemm function.With the old cublas interface (cublas.h), you set the stream globally using cublasSetKernelStream( ). Subsequent cublas calls use that stream.With magma, you set the stream globally using magmablasSetKernelStream( ). Subsequent magmablas and cublas calls use that stream.

but I have not managed to make them run simultaneously i.e. they don't overlap. Ideally I would like them to overlap in time as they don't depend on each other. How could I do it? Do you have any ideas of how?

Each of your gemms looks like it fills up the whole GPU for over 1 second, which I would expect for 9000x24000 matrix. So I would not expect them to overlap. If they did overlap, the total time would not decrease any. That is, both gemms would get half the GPU for twice as long, together taking 2 seconds. In other words, for this size problem, I do not see any advantage to overlapping the gemms, nor any way to force them to overlap.

What if I use multiple GPUs? In theory using 2 GPUs will divide the time by two, is there any reason I would not take advantage of that? Could magmablas_sgemm access another GPUs addresses as it would happen with a CUDA kernel?Cheers,

Yes, you can divide the matrix in half and compute the gemm in parallel. For instance, A [B0 B1] = [C0 C1]becomes A B0 = C0 and A B1 = C1.Note while B and C are split, the matrix A must be duplicated on the two GPUs. Or you can split the other way, [ A0 ] B = [ C0 ] [ A1 ] [ C1 ]For most linear algebra algorithms, distributing the matrix in a block column cyclic fashion, or sometimes block row cyclic, is efficient for a small number of GPUs.