The design of parallel programming models that achieve a good trade-off between productivity and efficiency, while maintaining performance portability and cost transparency, remains a challenging task. Similarly, parallel runtime cost modeling is essential for application and architecture design, as well as performance optimization; however, cost accuracy remains limited when modeling the effect of bandwidth bottlenecks for globally unbalanced communication.
This dissertation proposes a hybrid dataflow model (CGD) that leverages the simplicity and elegance of dataflows and the good performance scalability of Single Program Multiple Data (SPMD) computations. Benchmark analysis shows that the CGD model increases the productivity while maintaining or exceeding the performance of the MPI and pthreads models. The thesis also presents a hierarchical bandwidth machine model (&alpha;DBSP) that can estimate the execution time of CGD collective communication by naturally extending and improving the Decomposable Bulk Synchronous Parallel (DBSP) model.
The CGD model is a dataflow graph with SPMD computation nodes and datastructure decomposition data nodes, which exploits dataflow semantics to express data and task parallelism at a high-level, and relies on imperative languages to express efficient sequential computations. Data and computation partition and assignment are explicit, while communication, synchronization, and machine specific optimizations are handled automatically.
This dissertation introduces a coordination language with dataflow semantics that implements the CGD model, and presents several applications and their optimizations implemented in this language. The CGD runtime supports MPI, SHMEM, and pthreads running on both shared memory and cluster machines. The results from an 128 processor SGI Altix 4700 system show that the optimized CGD FT outperforms NPB2.3 MPI by 27%, the optimized CGD stencil is 41% faster vs. handwritten MPI, and the CGD Barnes-Hut particle simulation improves SPLASH2 by 14%.
The &alpha;DBSP model extends DBSP by associating a bandwidth growth factor &alpha; to message patterns, improves DBSP in terms of execution time, and helps machine bandwidth budgeting by estimating application hierarchical bandwidth. Consequently, for some globally unbalanced problems the \&alpha;DBSP analysis is more accurate, and sometimes simpler. E.g., the single-element nearest-neighbor message exchange running on a pruned butterfly requires O(log^{3}(p)) on &alpha;DBSP vs. O(\sqrt{p}) on DBSP, while optimally modeling the one-to-all broadcast requires a single communication step on &alpha;DBSP vs. O(log(p)) steps on DBSP. We present three scientific computing kernels that illustrate the differences between &alpha;DBSP and DBSP analysis.