probably in the paper they compare 1 CPU core/thread against 1 GPU. It is a common pactice: most of the works on GP-GPU compare GPU performance vs single cpucore/thread performance. Anyway, are you using a multithreaded LAPACK library CPU side? This could explain the discrepancy you see in the CPU performance.

mtacconi wrote: [...] probably in the paper they compare 1 CPU core/thread against 1 GPU. [...] Anyway, are you using a multithreaded LAPACK library CPU side?

The paper claims they are using "MKL's parallel BLAS" with "MKL 10.0". I'm using MKL as well. I'd think that hindering the multi-core BLAS would hurt the GPU performance as well considering MAGMA is a hybrid implementation.

mtacconi wrote: [...] probably in the paper they compare 1 CPU core/thread against 1 GPU. [...] Anyway, are you using a multithreaded LAPACK library CPU side?

The paper claims they are using "MKL's parallel BLAS" with "MKL 10.0". I'm using MKL as well. I'd think that hindering the multi-core BLAS would hurt the GPU performance as well considering MAGMA is a hybrid implementation.

I see your point but some time ago I gave a try to the hessemberg factorization dgehrd routine of magma 0.2. and I was able to see the claimed 20x speed-up only against a single core/thread. Here the speedups I recorded:

From these results at that time I concluded they are comparing against 1 cpu thread. I have to say I didn't investigate any further because I was and still I am basically intersted in the tridiagonalization routine and hermitian eigenproblem.

The testing code is written in F95 + some F2003 extension (iso_c_binding module) pm me if interested.

Thanks for reporting and trying to figure out the reasons for these descrepenacies. They are due to the GPU BLAS implementation used - in the paper we used customized BLAS kernels that are not yet in the release. The high level algorithms though, as described in the paper, are in the release.

Talking specificly for the SSYTRD routine, its performance critically depends on the speed of SSYMV (as 50% of the flops are in SYMV). Theoretically SSYMV can run up to 142 GFlop/s on a GTX280 (bus speed 142 GB/s), so if this is available, the SSYTRD from MAGMA 1.0 RC3 would run asymptotically at speed above 142 GFlop/s. In reality though this SYMV performance is not possible. CUBLAS SSYMV would run at below 10 GFlop/s and as a result the MAGMA SSYTRD, using CUBLAS SSYMV, would run at about that speed as well. The paper used a SSYMV kernel running at up to ~80 GFlop/s and so the MAGMA SSYTRD using that kernel goes to about that speed. Although this may sound impressive, there is obviously a lot of room for improvement. Indeed, we developed another SSYMV (shortly after the paper was submitted) that reached up to a little above 100 GFlop/s and along other optimizations the SSYTRD actually reached close to 120 GFlop/s.

The development of BLAS consumes a lot of effort especially with GPU changes coming frequently. For example in Fermi we had to redesign some BLAS algorithms. Level 2 BLAS on Fermi is also slow as the bus bandwidth was not increased, while ECC was added, further reducing the bandwidth available to users. Therefore we may even consider dropping MAGMA BLAS support from MAGMA. The CUBLAS GEMM is based on the MAGMA GEMM, so similarly, we will be happy to provide highly optimized MAGMA BLAS to NVIDIA to be incorporated and maintained in CUBLAS.

That's correct. The currently released MAGMA 1.0 RC3 SSYRK allocates and frees work memory inside the kernel, just to be compliant with the BLAS standard. This by itself reduces performance by about 20 GFlop/s (compared to an expert interface passing the work space from outside the routine). What is released is still much faster than CUBLAS SSYMV but it is not the 100 GFlop/s kernel. Stan

In this way is actually possible to squeeze some more Gflops from the current release of magma: from a somewhat disappointing 19Gflops of the dsytrd that uses the cublasDsymv to a more comfortable sustained 29 Gflops by using magma_dsymv6_fermi :)