초록

It is expected that the first exascale supercomputer will be deployed within the next 10 years, however both its CPU architecture and programming model are not known yet. Multicore CPUs are not expected to scale to the required number of cores per node, but hybrid multicore CPUs consisting of different kinds of processing elements are expected to solve this issue. They come at the cost of increased software development complexity with eg, missing cache coherency and on-chip NUMA effects. It is unclear whether MPI and OpenMP will scale to exascale systems and support easy development and scalable and efficient programs. One of the programming models considered as an alternative is the the so-called partitioned global address space~(PGAS) model, which is targeted at easy development by providing one common memory address space across all cluster nodes. In this paper we first outline current and possible future hardware and introduce a new abstract hardware model able to describe hybrid clusters. We discuss how current shared memory, GPU and PGAS programming models can deal with the upcoming hardware challenges and describe how synchronization can generate unneeded inter- and intra-node transfers in case the memory consistency model is not optimal. As a major contribution, we introduce our variation of the PGAS model allowing implicit fine-grained pair-wise synchronization among the nodes and the different kinds of processors. We furthermore offer easy deployment of RDMA transfers and provide communication algorithms commonly used in MPI collective operations, but lift the requirement of the operations to be collective. Our model is based on single assignment variables and uses a data-flow like synchronization mechanism. Reading uninitialized variables results in the reading thread to be blocked until data are made available by another thread. That way synchronization is done implicitly when data are read. Explicit tiling is used to reduce synchronization overhead and to increase cache and network utilization. Broadcast, scatter and gather are modeled based on data distribution among the nodes, whereas reduction and scan follow a combining PRAM approach of having multiple threads write to the same memory location. We discuss the Gauß-Seidel stencil, bitonic sort, FFT and a manual scan implementation in our model. We implemented a proof-of-concept library showing the usability and scalability of the model. With this library the Gauß-Seidel stencil scaled well in initial experiments on an 8-node machine and we show that it is easy to keep two GPUs and multiple cores busy when computing a scan.