Rapport, 2016

Single threaded tasks are the basic unit of scheduling in modern runtimes targeting multicore hardware. However, as energy and performance become increasingly dominated by communication costs, the basic computational building block of applications should move away from this thread-centric view. Instead, cache-centric views should become a first-class abstraction.
This technical report describes a parallel execution model called TAO, and its supporting runtime, designed to handle computations with strict caching requirements. The central component of the model are self-managed parallel computations called task assembly objects (TAO). Task assemblies aggregate i) tasks, ii) caches and iii) a private scheduler into an atomic unit that is scheduled in bulk over a cache partition in the multicore topology. Such cache-centric scheduling allows TAO to better utilize the resources of communication-constrained compute platforms.
TAO applications follow a flow of parallel patterns programming model, which encapsulates semantic knowledge on locality-aware computing. We evaluate the TAO prototype runtime (go:tao) on three applications with common parallel patterns: the Unbalanced Tree Search benchmark (mixed-mode parallelism), a 2D Jacobi iteration (stencil pattern) and a parallel sorting code (reduction pattern). On a system consisting of eight NUMA nodes, three levels of caching, and a total of 48 cores, go:tao shows excellent scalability with reduced communication compared to contemporary thread-centric runtimes.