Programming Challenges Workshop

Challenges To Be Addressed:

Exascale systems may have one million nodes, each with up to one thousand cores, providing roughly one billion cores capability. Such systems present both opportunities and challenges to scientific applications and to the software stack that supports their ability to express and manage up to a billion separate threads—a factor of 10,000 greater than on current platforms. For these systems, weak scaling alone will not help achieve these levels of concurrency, since the available memory will not increase by the same factors. We will witness a radical switch from bulk-synchronous computing to asynchronous computing, mostly due to the need to operate on irregular data structures and the opportunities for asynchronous execution environments. Asynchronous tasks will also be needed in order to hide latency and variability among cores, which may increase the level of concurrency by a factor of 10 for general purpose cores and by a factor of 100 for accelerator cores. Energy efficiency constraints and growth in explicit on-chip parallelism will require a mass migration to new algorithms and software architecture that is as broad and disruptive as the migration from vector to parallel computing systems that occurred 15 years ago.

Programmability of Exascale systems must be substantially improved. Historically, programmability of HPC machines has not been demanded as a top priority by the DOE community. Each HPC machine generation has required extensive performance tuning, algorithmic changes, and code restructuring. With the significant increase in complexity of Exascale platforms due to energy-constrained billion-way parallelism and heterogeneity, we must avoid repeating historical trends and commit to a radical change in our abstractions for machines, their execution models, programming models, language constructs, and compilers so as to obtain software that is f portable and scalable for multiple generations of future HPC hardware.

The current MPI+Fortran/C/C++ programming environment has sustained HPC application software development for the past decade, but was designed in an era of single-threaded nodes and architected for coarse-grained concurrency largely dominated by bulk-synchronous algorithms. While it is theoretically possible to execute a separate MPI process on each core, that will not be a viable approach for Exascale hardware. With the end of clock-frequency scaling and memory lagging processor growth, applications and algorithms will need to rely increasingly on fine-grained parallelism, strong scaling, and improved support for fault resilience on increasingly heterogeneous/hierarchical hardware.

Existing classes of programming models, such as message passing, global address space (GAS), and global task space (GTS), will go through a metamorphic process as they are modified to support new hardware architectures. There is an urgent need to both extend existing approaches and to identify new programming models and language constructs that can express fine-grained asynchronous parallelism to achieve performance, programmability, and efficiency in the face of these disruptive technology changes while simultaneously meeting the needs of an evolving and increasingly unstructured application base.

Enabling scientists to focus on their science rather than the fine details of a complex exascale system is essential. Exascale applications will become a collaborative effort, requiring a hierarchy of developers for the different levels of programming focus. Programs at a given level of abstraction will be optimized without knowledge of how lower layers are implemented. The mapping of the high level constructs, devised by domain scientists, to the machine optimized constructs, will require novel work on programming language components and on compilers.

Programs for exascale computing should be portable by design (“write-once-run-everywhere”) and written to scale across multiple hardware generations. This requires a machine abstraction for exascale computing such that compilers (or whatever mechanisms automatically translate semantic operations) can explore performance alternatives.