Abstract. P-RISC graphs are parallel, structured, executable controlflow
graphs that model locality properties of distributed memory machines.
P-RISC graphs have coarse grain parallelism for distribution of
work to processors, and fine grain parallelism to overlap latencies of
remote accesses and synchronization. These properties support dynamic
and irregular parallelism. In this paper, we describe P-RISC graphs, how
they are used to implement the implicitly parallel programming language
Id, and how they are implemented on distributed memory machines. We
also contrast our approach to Berkeley's TAM, a similar system.

1 Introduction

HPF [4] is an example of a trend back towards shared memory programming
models, even though massively parallel machines are implemented with distributed
memories. This approach is regaining popularity because it is now
widely recognized that writing explicit message-passing programs is extremely
hard even for a fixed configuration, and even harder if it is to be portable and
tunable. However, for good performance on distributed memory systems, locality
is very important. Today, it is infeasible to achieve completely automatic
distribution of data and processes for good locality. The HPF solution is to rely
on the programmer to specify data distribution of arrays. The compiler analyzes
this information and uses it to distribute loop iterations close to their data; to
batch communications (reducing overhead), and to overlap communication with
computation (reducing idling) [5].

We are interested in more general shared memory programs where data structures
may be irregular (trees, graphs, sparse arrays, ...), may be dynamically allocated,
and may have data-dependent shapes and sizes. Further, such programs
obtain much parallelism from producer-consumer situations, where a data structure
is consumed by one process even as it is being built by another. In this
environment, the compiler cannot orchestrate process and data placement in the
manner of HPF compilers. Almost any individual data accesses may be remote,
and may need synchronization.

Our general approach is based on multithreading. Since it is generally not
known until runtime whether a particular data access is local or remote, or
whether it will block (consumer ahead of producer), the processor must be prepared
to multiplex efficiently to another thread instead of idling. But this raises