Whitepaper: Efficient Parallel In-place Square Matrix Transposition

Colfax Research has released a new whitepaper by Andrey Vladimirov entitled: Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors.

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency. This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel Xeon CPUs and 113 GB/s (67% efficiency) on Intel Xeon Phi coprocessors.

Resource Links:

Latest Video

Industry Perspectives

"Exascale computers are going to deliver only one or two per cent of their theoretical peak performance when they run real applications; and both the people paying for, and the people using, such machines need to have realistic expectations about just how low a percentage of the peak performance they will obtain." [Read More...]