Compilation for Scalable, Paged Virtual Hardware

We are investigating automatic compilation for a hybrid processing architecture based on the SCORE compute model [1]. SCORE targets scalable reconfigurable hardware, enabling applications to automatically exploit the additional hardware of next-generation devices, without recompilation. SCORE implements a paged virtualization model, where a spatial computation is partitioned into compute pages (analogous to virtual memory pages) to be automatically loaded into available physical pages by a run-time scheduler. We have developed a parallel programming model for SCORE that supports arbitrarily large spatial computations, and we are presently developing a compiler to map this programming model to a paged hardware execution model.

We have developed an intermediate language, called TDF, for describing SCORE computations as dataflow graphs of communicating finite state machines (FSMs). The graphical form provides an explicit specification of task-level parallelism and stream-based communication. The programming model is equivalent, in power, to dynamic dataflow models such as BDF [2]. We have developed a SCORE compiler, consisting presently of a TDF front-end, limited compiler optimizations, and a back-end that emits executable C++ simulation code. We are presently extending the compiler to include more sophisticated optimizations (to exploit instruction-level parallelism) and automatic page generation.

The goal of automatic page generation is to partition and/or cluster TDF FSM operators into fixed-size hardware pages under area and I/O constraints, and in a manner that is robust to hardware virtualization. Under virtualization, a pair of communicating pages may not necessarily execute simultaneously, requiring stream communication to be buffered in memory. Partitioning for efficient execution under virtualization involves a tradeoff between several desirable properties: (1) minimized FSM latency, (2) minimized stream IO bandwidth, and (3) avoidance of inter-page feedback loops. Our approach involves first hoisting control-independent datapath operations out of FSMs, where possible, and partitioning/clustering the resulting pipelines using known techniques for synchronous circuits. We are presently exploring techniques for partitioning the remaining control-intensive FSMs and their associated datapaths.