I obtained some (may be not nonsense) GFLOPS values for C2050 with testing_dgemm run wheresmall 32 x 32 matrices were used. But like dgemm call requires only something about 1E-5 sec for one Nehalem/2.7 Ghz core.

But the question was a bit more stupid: what measures get_current_time call and what is therefore time differencebetween end and start time (does it include PCI transfer time, as I beleive) ? And what is about get_current_time resolution ?

Function get_current_time calls gettimeofday, so the resolution is a microsecond. Before getting the gettimeofday there is a call to cudaThreadSynchronize() to make sure previous GPU tasks have completed. Thus one can measure the time of a particular GPU kernel by surrounding it by calls to get_current_time. If between two get_current_time calls there are functions transferring data, the time measure will include the time for the transfer.

There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?

It looks that magma_dgemm itself includes host-device data exchanges (right?).

No. We measure the time for dgemm on the GPU, i.e., we assume the data and the result will be on the GPU memory.

There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?

This call is not allocating memory. The memory allocation is before. This call only sets up the matrix values in the GPU memory by copying them from the CPU memory. The transfer of a 32x32 matrix will be significant time of the magma_dgemm execution.

It will depend on what you need to accelerate. If you have the matrix on the CPU, want the result on the CPU as well, and want to check if you can accelerate this using a GPU, you must modify the testing_dgemm code to include the memory transfers. The current MAGMA GEMM is an optimized implementation of DGEMM for GPU where the inputs and the output is on the GPU. A CPU interface GEMM must be hybrd, taking into account transfer times, and the CPU and GPU computational power, e.g., see