Harnessing Adaptivity Analysis for the Automatic Design of Efficient Embedded and HPC Systems

Supervisor:

FERRANDI FABRIZIO

Thesis abstract:

Embedded Systems (ESs) and High-Performance Computing (HPC) systems belong to two distinct areas of the Information Technology (IT). On one side, embedded systems are usually designed as hardware implementation of a specific functionality, with the aim of either reducing power consumption, or accelerating the most frequently executed portion of an application. On the other side, high-performance computing systems are designed as massively parallel supercomputers with tens of thousands of processors, usually employed to solve complex, highly parallel scientific problems. Due to the distinct nature of the two domains, these systems have historically evolved following different trends. However, many issues and design challenges are lately arising, which affect both the domains. Among the examples, increasing computational demands and large parallelism degrees, together with the growing capabilities of silicon technology, led to the design of increasingly complex architectures and applications. As a consequence, modern embedded systems exploit the potentialities of hundreds or thousands of processing units, often heterogeneous and physically distributed, which run in parallel on the many-core platform. These same factors led to massively parallel systems in HPC domain. As another example, power constraints, which have always been a critical requirement for embedded systems design, are becoming more and more relevant also in the high-performance domain. As a consequence, communication minimization techniques are required also for modern HPC systems design. This trend suggests that HPC systems design techniques can be exploited, at a different abstraction level, by embedded systems designers to address shared issues, and vice versa. In other words, designing modern supercomputers, as well as modern embedded systems, requires a holistic approach that relies on tightly coupled hardware-software co-design methodologies. Among the numerous issues which are shared between the ES and the HPC domains, this thesis focuses on unknown, uncertain and unpredictable behaviors. Modern computing systems have to deal with this kind of behaviors, which are often related to incomplete information at design/compiler time. Unknown, uncertain and unpredictable information can be originated by different sources. For example, the interaction with external modules (e.g., sensors, IPs, memories) generates variable and possibly unpredictable communication latency. Moreover, the nature of high-level applications is inherent uncertain (e.g., unknown number of loop iterations, unknown values of the incoming arguments of a function, unknown outcome of conditional instructions evaluation, or unpredictable memory accesses). Incomplete information at design or compile time often leads to suboptimal designs: to prevent the system from a failure, the designer must either consider all the possible scenarios, with consequent increase in area and complexity of the circuit, or take a conservative approach, with consequent decrease in performance. Moreover, this approach is error prone, since the correctness cannot be guaranteed in all the situations (such as run-time events which the designer cannot predict, or technology parameters which change unpredictably during the life cycle of the device, for example due to component degradation). In this scenario, systems able to dynamically adjust their behavior at run-time appear to be good candidates for the next computing generation in both the domains. This work aims at defining efficient design methodologies for modern embedded systems and high-performance computing systems, able to deal with unknown, uncertain and unpredictable behaviors. For such purpose, this thesis introduces the concept of Adaptivity Analysis, which provides a formal approach to study the adaptivity properties of the applications. Given the different scale of the problem in the ESs and HPCs domain, Adaptivity Analysis is defined at two distinct abstraction levels. In the embedded systems domain, Adaptivity Analysis addresses the problem of designing efficient adaptive hardware cores. More in detail, this analysis identifies, at design-time, the conditions that each instruction (or group of instructions) must satisfy to execute, according to their dependences. Such conditions, called Activating Conditions (ACs), are logic formulas. Since unknown information may occur at design-time, the ACs provide parametric results, which depend on conditions that will be resolved at run-time (e.g., outcome of the evaluation of a conditional instruction). At run-time, as soon as the unknown information is resolved, the ACs are evaluated. Once an AC is satisfied, the corresponding instruction can safely execute, thus providing an explicit activation mechanism for the instruction, which allows their dynamic auto-scheduling. Such a scheduling technique, called dynamic AC-scheduling, provides support for the High-Level Synthesis (HLS) of adaptive hardware cores. Moreover, this thesis defines a prototype for this kind of system. Furthermore, this work defines a proper Intermediate Representation (IR), called Extended Program Dependence Graph (EPDG), able to provide a parallel execution model for adaptive hardware cores. Finally, the dynamic AC-scheduling has been implemented as part of the PandA framework, thus providing a fully automated design methodology. In the HPC domain, Adaptivity Analysis is used as automatic compiler-based generation of parallel irregular applications. In HPC design, run-time systems are entitled for managing unknown, uncertain and unpredictable behaviors. Hence, adaptivity analysis is used in this context as a compiler technique for the support of run-time systems. More in detail, this work focuses on the automatic parallelization of a particular class of applications, known as Irregular Applications. Due to their irregular nature (i.e., irregular data structures, irregular control flow, irregular accesses to the communication network), the parallelization of irregular applications is significantly challenging. This thesis introduces the Yet Another Parallel Programming Approach (YAPPA) compilation framework. YAPPA aims at efficiently parallelizing irregular applications running on commodity clusters. The novel parallel programming approach supported by YAPPA is defined by the underlying run-time library, called Global Threading and Memory (GMT). GMT cross combines different parallel programming approaches, which are usually exploited separately in the literature: it integrates Global Address Space (GAS) across cluster nodes, lightweight multithreading for tolerating memory and network latencies, and support for a fork/join programming model. YAPPA extends the LLVM compiler with a set of transformations and optimizations, which at first instrument the sequential code to run GMT primitives (parallelization phase), and then apply a novel set of transformations to improve the efficiency of the generated parallel code (optimization phase).