You don't use the right way to get time. The time you get is user + system time but for 8 threads. So you have to divide by 8 to have a approximation of the time.We use this function to measure the time :

By the way, CORE_dgemm is less than 8 times more expensive because PLASMA_dgemm has to convert the matrix in block data layout before apply the gemm, and convert the result to Lapack Layout to give it to you.

The command line interface of time_dgemm is not the most intuitive.Your K dimension was set to 1 in all cases.You need to use the --nrhs=X option to set it to a more reasonable size.Also, if your system is a NUMA system, using numactl --interleave=all usually improves performance.Try the following call: