Publication

Modern NUMA multi-core machines exhibit complex latency
and throughput characteristics, making it hard to
allocate memory optimally for a given program’s access
patterns. However, sub-optimal allocation can significantly
impact performance of parallel programs.
We present an array abstraction that allows data placement
to be automatically inferred from program analysis,
and implement the abstraction in Shoal, a runtime library
for parallel programs on NUMA machines. In Shoal,
arrays can be automatically replicated, distributed, or
partitioned across NUMA domains based on annotating
memory allocation statements to indicate access patterns.
We further show how such annotations can be automatically
provided by compilers for high-level domainspecific
languages (for example, the Green-Marl graph
language). Finally, we show how Shoal can exploit additional
hardware such as programmable DMA copy engines
to further improve parallel program performance.
We demonstrate significant performance benefits from
automatically selecting a good array implementation
based on memory access patterns and machine characteristics.
We present two case-studies: (i) Green-Marl,
a graph analytics workload using automatically annotated
code based on information extracted from the highlevel
program and (ii) a manually-annotated version of
the PARSEC Streamcluster benchmark.