So you probably know that dgemm_tile is faster than dgemm, because it skips the layout translation.So, your question is about the drop-off when exceeding 12 cores.The first thing on my mind is a NUMA effect.Try using numaclt --interleave=allJakub