Technical Report 2016-010

Afshin Zafari, Elisabeth Larsson, and Martin Tillenius

June 2016

Abstract:

Current high-performance computer systems used for scientific computing typically combine shared memory compute nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism. We have previously developed a task library for shared memory systems which performs well compared with other libraries. Here we extend this to distributed memory architectures. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. Our experiments on implementing distributed Cholesky factorization show that our framework has low overhead and scales to at least 800 cores. We perform a comparison with related frameworks and show that DuctTeip is highly competitive in its class of frameworks.