Project Summary

Understanding how fluids and plasmas behave under complex physical conditions is on the basis of some of the most important questions that researchers try to answer. These range from practical solutions to engineering problems to cosmic structure formation and evolution. In that respect, numerical simulations of fluids in astrophysics and computational fluid dynamics (CFD) are among the most computationally demanding calculations in terms of sustained floating point operations per second (FLOP/s). It is expected that they will benefit greatly from the future Exascale computing infrastructures, that will perform 1018 FLOP/s. This type of scenarios pushes the computational astrophysics and CFD fields well into sustained Exascale computing. Nowadays, they can only be tackled by either reducing the scale, the resolution and/or the dimensionality of the problem, or using approximated versions of the physics involved. How this affects the outcome of the simulations, and therefore our knowledge on the problem, is still not well understood.

The simulation codes used in numerical astrophysics and CFD (hydrocodes, hereafter) are numerous and varied. Most of them rely on a hydrodynamics solver that calculates the evolution of the system to be studied along with all the coupled physics. Among these hydrodynamics solvers, the Smooth Particle Hydrodynamics (SPH) technique is a purely Lagrangian method, with no subjacent mesh, where the fluid can freely move in a practically boundless domain, this being very convenient for astrophysics and CFD simulations. SPH codes are very important in astrophysics because they couple naturally with the fastest and most efficient gravity solvers such as tree-code and fast multiple methods. Nevertheless, the parallelization of SPH codes is not straightforward due to its boundless nature and the lack of a structured grid, causing continuously changing interactions between fluid elements or between fluid elements and mechanical structures, from one time-step to the next. This, indeed, poses an additional layer of complexity in parallelizing SPH codes, yet it also renders them a very attractive and challenging application for the computer science community in view of its parallelization and scalability challenges for the upcoming Exascale computing systems.

We aim in this project to have a scalable and fault tolerant SPH kernel, developed into a mini/proxy co-design application. The SPH mini-app will be incorporated into current production codes in the fields of astrophysics (SPHYNX, ChaNGa), and CFD (SPH-flow), producing what we call the SPH-EXA version of those codes.

The SPH-EXA project has the following main objectives:

Design parallelization methods targeted for SPH codes, that can be ported to other codes in the scientific community. Parallelization of the SPH technique will involve both automatic and manual methods. For automatic parallelization, we will use advanced compilation options as well as stencil compilers. This will require rewriting parts of the SPH codes to enable vectorization and other loop transformations, as well as adapting existing stencil compilers to the SPH codes. For manual parallelization, we will employ shared-memory, and accelerator-based programming, task-based programming for on-node multi-threaded execution, as well as distributed-memory programming for the multi-process execution across computing nodes. The goal is to expose all available parallelism both at node and across-node levels.

Enable the scalability and dynamic load balancing of the SPH hydrodynamics technique within single compute nodes and across massive numbers of nodes. Enabling the scalable execution of SPH codes is based on the massive software parallelism exposed and expressed during (automatic and/or manual) parallelization. This will require hierarchical and/or distributed dynamic load balancing techniques to exploit the massive hardware parallelism at running time. We will employ algorithms, techniques, and tools that address the load imbalance factors arising from the (problem and algorithmic) characteristic of the three SPH codes (e.g., individual time-steps per particle) as well as from the software environments (processor speed variations, resource sharing). The goal is to minimize the load imbalance between synchronous parts of the code (e.g., gravity calculations) by dynamically distributing the load to the processors, using methods such as those described in [2,3,4,5].

Design fault-tolerance mechanisms to sustain the scalable execution of massively parallel SPH codes. The fault tolerant mechanisms will combine the use of dynamic fault-tolerant scheduling algorithms, complemented by methods to determine the optimal checkpointing frequency for the SPH technique on given architectures. Most importantly, we will explore the algorithm-based fault tolerance opportunities within the SPH technique to achieve portable fault-tolerance across architectures and independent of checkpointing mechanisms. We envision to provide fault-tolerant and non-fault-tolerant versions of the SPH codes, to maintain flexibility between high performance at smaller scales and fault-tolerant high performance at larger scales.

Build a repository of experiments to enable verification, reproducibility, and portability of the execution and simulation results of SPH-EXA codes. To enable verification and reproducibility of the SPH simulations, as well as to support parallel performance studies we will use reproducibility tools to configure and run the codes. Such tools will aid in resolving software dependencies, facilitate environment configuration, automate the software build process, provide support for creating execution and post-processing scripts, and visualize the results. As of now prova! [23] (our reproducibility tool of choice) automatically provides graphs for the performance analysis, relying on Gnuplot1, and will be extended to provide support for tools such as Visit2 and ParaView3, since visualization is extremely important when working with a huge number of particles.

There are currently in SPH codes: (1) no multi-level scheduling approaches (connecting thread or task level schedulers with process level schedulers), (2) no (or limited) algorithm based fault-tolerance, (3) very limited scalability within and across nodes (existing work has either/or, but not both). As an example, the largest, most recent high-resolution simulations of galaxy formation, such as GigaERIS (Mayer, Quinn, et al., in preparation), which employ more than a billion resolution elements with individual time-steps, does not scale to more than 8,000 compute cores on state-of-the-art architectures, such as the Cray platforms. High-impact work already published in the last few years employed SPH simulations, such as ERIS [22], that were not scaling even to 1,000 compute cores. This degree of scalability is clearly below what we need to exploit upcoming Exascale supercomputers.

The methodology that we will use to achieve these goals is a unique combination between (1) state-of-theart parallelization and fault tolerance methods from computer science, (2) state-of-the-art SPH technique and expertise from physics, and (3) expertise in high-performance computing on state-of-the-art computing architectures.

The expected outcome of this project will be in the form of an open-source SPH mini-app, that will enable highly parallelized, scalable, and fault-tolerant production SPH codes in di↵erent scientific domains (represented via SPHYNX, ChaNGa, and SPH-flow in its very first application). Addressing the performance and scalability challenges of SPH codes requires a versatile collaboration with and support from supercomputing centers, such as CSCS, such that our results can be taken into account for the design of the next generation HPC infrastructures.

The success of the project will be measured in the achieved improvements, over their current levels, in speed-up, fault-tolerance, flexibility (in terms of numerical techniques), and portability of the SPH-EXA codes.

In summary, this project addresses the challenge of rendering the SPH technique and the SPH-based simulation codes scalable to future Exascale computing systems. We target the performance, portability, scalability, and fault tolerance of three SPH codes on the next generation supercomputers, such as those at CSCS, that are expected to contain hybrid CPU-MIC-accelerator architectures and a high-end interconnection fabric.