"... Thread-Level Speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel ex-ecution of threads that might not actually be independent. In this paper, we show that the key to good performance lies in the three different ways to communicate a value between ..."

Thread-Level Speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel ex-ecution of threads that might not actually be independent. In this paper, we show that the key to good performance lies in the three different ways to communicate a value between speculative threads: speculation, synchronization, and prediction. The diffi-cult part is deciding how and when to apply each method. This paper shows how we can apply value prediction, dynamic synchronization, and hardware instruction prioritization to im-prove value communication and hence performance in several SPECint benchmarks that have been automatically-transformed by our compiler to exploit TLS. We find that value prediction can be effective when properly throttled to avoid the high costs of mis-prediction, while most of the gains of value prediction can be more easily achieved by exploiting silent stores. We also show that dynamic synchronization is quite effective for most benchmarks, while hardware instruction prioritization is not. Overall, we find that these techniques have great potential for improving the per-formance of TLS. 1

... speculative threads is through the prediction of memory values. But which loads should we predict? A simple approachwould be to predict every load for which the predictor is confident. Previous work =-=[4, 11]-=- shows that focusing prediction on critical path instructions is important for uniprocessor value prediction when modeling realistic misprediction penalties. Similarly, the cost of misprediction in TL...

"... This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the ..."

This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units. In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.

...scalar processors, since there are virtually no resource constraints (except the LSQ bandwidth). We have developed two methods for discovering and visualizing the dynamic critical path, both based on =-=[FRB01]. -=-• The first method is based on capturing and post-processing a complete execution trace of a single selected procedure. A simulator is instrumented to dump all relevant events affecting operations i...

"... We show how the global critical path can be used as a practical tool for understanding, optimizing and summarizing the behavior of highly concurrent self-timed circuits. Traditionally, critical path analysis has been applied to DAGs, and thus was constrained to combinatorial sub-circuits. We formall ..."

We show how the global critical path can be used as a practical tool for understanding, optimizing and summarizing the behavior of highly concurrent self-timed circuits. Traditionally, critical path analysis has been applied to DAGs, and thus was constrained to combinatorial sub-circuits. We formally define the global critical path (GCP) and show how it can be constructed using only local information that is automatically derived directly from the circuit. We introduce a form of Production Rules, which can accurately determine the GCP for a given input vector, even for modules which exhibit choice and early termination. The GCP provides valuable insight into the control behavior of the application, which help in formulating new optimizations and re-formulating existing ones to use the GCP knowledge. We have constructed a fully automated framework for GCP detection and analysis, and have incorporated this framework into a high-level synthesis tool-chain. We demonstrate the effectiveness of the GCP framework by re-formulating two traditional CAD optimizations to use the GCP—yielding efficient algorithms which improve circuit power (by up to 9%) and performance (by up to 60%) in our experiments. 1

...GCP) that generalizes the critical path to encompass the entire execution of an arbitrarily complex circuit for a given input data set. Our work is based on the methodology proposed by Fields et. al. =-=[12]-=-, which analyzes the performance of superscalar processors. To extrapolate the notion of local criticality, Fields decomposes a circuit into black-box modules and shows informally how an approximation...

high-level synthesis, spatial computing, slack matching, operation chaining, asynchronous latch circuits, transaction level modeling, timing update This dissertation presents a System-Level Timing Analysis (SLTA) methodology and a micro-architectural optimization framework for use within hardware compilation. As the EDA abstraction layer of preference is raised to Electronic System Level (ESL), the focus is on describing systems using Transaction Level Modeling (TLM) [CG03, Pas02, Ede06], which is amenable to high-level synthesis. The proposed SLTA methodology and ESL optimization framework is designed to complement TLM-based synthesis flows by analyzing the sequential dependency behavior of system-level transactions. Using this knowledge, control-path-altering, microarchitecture optimizations are applied iteratively on a well-defined hardware Intermediate Representation (IR). There are two over-arching contributions in this dissertation. First, we describe an Intermediate Representation (IR) as a valuable addition to the infrastructure of a hardware compiler. The IR captures data/control dependencies in the source program as well as resource dependencies of the underlying circuit architecture. The IR is an abstraction of transaction events in the TLM but is also

...ying control-path altering optimizations within a hardware compiler or ESL toolflow. A further difference is in the granularity of the event models used to compute the critical path. The Fields model =-=[FRB01]-=- captures the entire microprocessor architecture using three essential components: fetch, execute and commit, although the micro-architectural and RTL model is much more complex. In contrast, the EBM ...

"... Abstract. Exposing more instruction-level parallelism in out-of-order superscalar processors requires increasing the number of dynamic in-flight instructions. However, large instruction windows increase power consumption and latency in the issue logic. We propose a design called Hybrid Dataflow Grap ..."

Abstract. Exposing more instruction-level parallelism in out-of-order superscalar processors requires increasing the number of dynamic in-flight instructions. However, large instruction windows increase power consumption and latency in the issue logic. We propose a design called Hybrid Dataflow Graph Execution (HeDGE) for conventional Instruction Set Architectures (ISAs). HeDGE explicitly maintains dependences between instructions in the issue window by modifying the issue, register renaming, and wakeup logic. The HeDGE wakeup logic notifies only consumer instructions when data values arrive. Explicit consumer encoding naturally leads to the use of Random Access Memory (RAM) instead of Content Addressable Memory (CAM) needed for broadcast. HeDGE is distinguished from prior approaches in part because it dynamically inserts forwarding instructions. Although these additional instructions degrade performance by an average of 3 to 17 % for SPEC C and Fortran benchmarks and 1.5 % to 8 % for DaCapo Java benchmarks, they enable energy efficient execution in large instruction windows. The HeDGE RAM-based instruction window consumes on average 98 % less energy than a conventional CAM as modeled in CACTI for 70nm technology. In conventional designs, this structure contributes 7 to 20 % to total energy consumption. HeDGE allows us to achieve power and energy gains by using RAMs in the issue logic while maintaining a conventional instruction set. 1

...gic chooses candidates from ready instructions for execution based on availability of execution units and other policy considerations such as age of the instruction and criticality of the instruction =-=[7]-=-. Instructions selected for execution read values from the register file in the next clock cycle. The hardware forwards values read from the physical register to appropriate functional units in the EX...

"... A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis research reported here advocates a microarchitecture and design paradi ..."

A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis research reported here advocates a microarchitecture and design paradigm that rely less on low-level speculation techniques and more on simpler, modular designs with distributed processing at the instruction level, i.e., instruction-level distributed processing (ILDP). This thesis shows that designing a hardware/software co-designed virtual machine (VM) system using an accumulator-oriented instruction set architecture (ISA) and microarchitecture is a good approach for implementing complexity-effective, high-performance out-of-order superscalar machines. The following three key points support this conclusion: • An accumulator-oriented instruction format and microarchitecture fit today’s technology constraints better than conventional design approaches: The ILDP ISA format assigns temporary values that account for most of the register communication to a small number of accumulators. As a result, the complexity of the register file and associated hardware

...s the value from the local accumulator. Another important point is that only those communication global values that happen to be on the program critical path might affect the program execution cycles =-=[82]-=-. That is, many register value communications can be hidden by parallel execution of multiple strands. 20s2.2 Strand: A Single Chain of Dependent Instructions A strand is a single chain of dependent i...

"... I wish to thank the multitudes of people who helped me along the way to completing this dissertation. I thank my committee, Doug Burger, Kathryn McKinley, Steve Keckler, Calvin Lin, and Steve Reinhardt, for their valuable feedback on my research. I am particularly indebted to my advisor, Doug Burger ..."

I wish to thank the multitudes of people who helped me along the way to completing this dissertation. I thank my committee, Doug Burger, Kathryn McKinley, Steve Keckler, Calvin Lin, and Steve Reinhardt, for their valuable feedback on my research. I am particularly indebted to my advisor, Doug Burger for his supervision of this work. I also thank several colleagues and friends who helped me along the way to completing this work including

...synchronization behaviors, which reflect application performance, and thread deadlocks, which reveal errors in applications. In Chapter 2, we study monitoring of thread microexecutions. Fields et al. =-=[15, 16]-=- proposed a directed acyclic graph (DAG) model for characterizing the microexecution of single-threaded programs on uniprocessors. Based on this model, they quantified the criticality of an instructio...