The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses. In this paper, we propose a method to use multi-level, embedded non-volatile memory (eNVM) to eliminate all off-chip weight accesses.

The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.

@article{Lok2018,
title = {A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots},
author = {Mario Lok and Elizabeth Farrell Helbling and Xuan Zhang and Robert Wood and David Brooks and Gu-Yeon Wei},
url = {http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2018/04/lok-tpe-2018.pdf},
year = {2018},
date = {2018-04-01},
journal = {IEEE Transactions on Power Electronics},
volume = {33},
number = {4},
pages = {3180 - 3191},
abstract = {This paper presents a power electronics design for the piezoelectric actuators of an insect-scale flapping-wing robot, the RoboBee. The proposed design outputs four high-voltage drive signals tailored for the two bimorph actuators of the RoboBee in an alternating drive configuration. It utilizes fully integrated drive stage circuits with a novel highside gate driver to save chip area and meet the strict mass constraint of the RoboBee. Compared with previous integrated designs, it also boosts efficiency in delivering energy to the actuators and recovering unused energy by applying three power saving techniques, dynamic common mode adjustment, envelope tracking, and charge sharing. Using this design to energize four 15 nF capacitor loads with a 200 V and 100 Hz drive signal and tracking the control commands recorded from an actual flight experiment for the robot, we measure an average power consumption of 290 mW.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

This paper presents a power electronics design for the piezoelectric actuators of an insect-scale flapping-wing robot, the RoboBee. The proposed design outputs four high-voltage drive signals tailored for the two bimorph actuators of the RoboBee in an alternating drive configuration. It utilizes fully integrated drive stage circuits with a novel highside gate driver to save chip area and meet the strict mass constraint of the RoboBee. Compared with previous integrated designs, it also boosts efficiency in delivering energy to the actuators and recovering unused energy by applying three power saving techniques, dynamic common mode adjustment, envelope tracking, and charge sharing. Using this design to energize four 15 nF capacitor loads with a 200 V and 100 Hz drive signal and tracking the control commands recorded from an actual flight experiment for the robot, we measure an average power consumption of 290 mW.

@unpublished{Reagen2017b,
title = {Weightless: Lossy Weight Encoding For Deep Neural Network Compression},
author = {Brandon Reagen, Udit Gupta, Robert Adolf, Michael M. Mitzenmacher, Alexander M. Rush, Gu-Yeon Wei, David Brooks
},
url = {https://arxiv.org/abs/1711.04686},
year = {2017},
date = {2017-11-13},
abstract = {The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}

The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.

@inproceedings{Kodali2017,
title = {Applications of Deep Neural Networks for Ultra Low Power IoT},
author = {Sreela Kodali and Patrick Hansen and Niamh Mulholland and Paul Whatmough and David Brooks and Gu-Yeon Wei},
year = {2017},
date = {2017-11-05},
booktitle = {International Conference on Computer Design},
abstract = {IoT devices are increasing in prevalence and popularity, becoming an indispensable part of daily life. Despite the stringent energy and computational constraints of IoT systems, specialized hardware can enable energy-efficient sensor-data classification in an increasingly diverse range of IoT applications. This paper demonstrates seven different IoT applications using a fully-connected deep neural network (FC-NN) accelerator on 28nm CMOS. The applications include audio keyword spotting, face recognition, and human activity recognition. For each application, a FC-NN model was trained from a preprocessed dataset and mapped to the accelerator. Experimental results indicate the models retained their state-of-the-art accuracy on the accelerator across a broad range of frequencies and voltages. Real-time energy results for the applications were found to be on the order of 100nJ per inference or lower.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

IoT devices are increasing in prevalence and popularity, becoming an indispensable part of daily life. Despite the stringent energy and computational constraints of IoT systems, specialized hardware can enable energy-efficient sensor-data classification in an increasingly diverse range of IoT applications. This paper demonstrates seven different IoT applications using a fully-connected deep neural network (FC-NN) accelerator on 28nm CMOS. The applications include audio keyword spotting, face recognition, and human activity recognition. For each application, a FC-NN model was trained from a preprocessed dataset and mapped to the accelerator. Experimental results indicate the models retained their state-of-the-art accuracy on the accelerator across a broad range of frequencies and voltages. Real-time energy results for the applications were found to be on the order of 100nJ per inference or lower.

This text serves as a primer for computer architects in a new and rapidly evolving field. We review how machine learning has evolved since its inception in the 1960s and track the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Next we review representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, we also detail the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs.

The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, we present a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.},
keywords = {},
pubstate = {published},
tppubtype = {book}
}

Machine learning, and specifically deep learning, has been hugely disruptive in many fields of computer science. The success of deep learning techniques in solving notoriously difficult classification and regression problems has resulted in their rapid adoption in solving real-world problems. The emergence of deep learning is widely attributed to a virtuous cycle whereby fundamental advancements in training deeper models were enabled by the availability of massive datasets and high-performance computer hardware.

This text serves as a primer for computer architects in a new and rapidly evolving field. We review how machine learning has evolved since its inception in the 1960s and track the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Next we review representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, we also detail the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs.

The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, we present a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.

In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.

@article{Zhang2017,
title = {A Fully Integrated Battery-Powered System-on-Chip in 40-nm CMOS for Closed-Loop Control of Insect-Scale Pico-Aerial Vehicle},
author = {Xuan Zhang and Mario Lok and Tao Tong and Sae Kyu Lee and Brandon Reagen and Pierre-Emile J. Duhamel and Robert Wood and David Brooks and Gu-Yeon Wei},
url = {http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2018/04/robobee-jssc.pdf},
year = {2017},
date = {2017-06-12},
journal = {IEEE Journal of Solid-State Circuits},
volume = {52},
number = {9},
abstract = {We demonstrate a fully integrated system-on-chip (SoC) optimized for insect-scale flapping-wing pico-aerial vehicles. The SoC is able to meet the stringent weight, power, and real-time performance demands of autonomous flight for a bee-sized robot. The entire integrated system with embedded voltage regulation, data conversion, clock generation, as well as both general-purpose and accelerated computing units, weighs less than 3 mg after die thinning. It is self-contained and can be powered directly off of a lithium battery. Measured results show open-loop wing flapping controlled by the SoC and improved energy efficiency through the use of hardware acceleration and supply resilience through the use of adaptive clocking.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

We demonstrate a fully integrated system-on-chip (SoC) optimized for insect-scale flapping-wing pico-aerial vehicles. The SoC is able to meet the stringent weight, power, and real-time performance demands of autonomous flight for a bee-sized robot. The entire integrated system with embedded voltage regulation, data conversion, clock generation, as well as both general-purpose and accelerated computing units, weighs less than 3 mg after die thinning. It is self-contained and can be powered directly off of a lithium battery. Measured results show open-loop wing flapping controlled by the SoC and improved energy efficiency through the use of hardware acceleration and supply resilience through the use of adaptive clocking.

@conference{Kanev2017,
title = {Mallacc: Accelerating Memory Allocation},
author = {Svilen Kanev and Sam (Likun) Xi and Gu-Yeon Wei and David Brooks},
url = {http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2017/02/asplos17mallacc.pdf},
year = {2017},
date = {2017-04-08},
booktitle = {International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
abstract = {Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 μm 2 of silicon area, less than 0.006% of a typical high-performance processor core.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}

Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 μm 2 of silicon area, less than 0.006% of a typical high-performance processor core.

This paper presents a 16-core voltage-stacked system with adaptive frequency clocking (AFClk) and a fully integrated voltage regulator that demonstrates efficient on-chip power delivery for multicore systems. Voltage stacking alleviates power delivery inefficiencies due to off-chip parasitics but adds complexity to combat internal voltage noise. To address the corresponding issue of internal voltage noise, the system utilizes an AFClk scheme with an efficient switched-capacitor dc-dc converter to mitigate noise on the stack layers and to improve system performance and efficiency. Experimental results demonstrate robust voltage noise mitigation as well as the potential of voltage stacking as a highly efficient power delivery scheme. This paper also illustrates that augmenting the hardware techniques with intelligent workload allocation that exploits the inherent properties of voltage stacking can preemptively reduce the interlayer activity mismatch and improve system efficiency.

@inproceedings{Shao2016,
title = {Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin},
author = {Yakun Sophia Shao and Sam (Likun) Xi and Vijayalakshmi Srinivasan and Gu-Yeon Wei and David Brooks},
url = {http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2016/08/shao_micro2016.pdf},
year = {2016},
date = {2016-10-17},
booktitle = {International Symposium on Microarchitecture (MICRO)},
abstract = {Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.

@inproceedings{Adolf2016,
title = {Fathom: Reference Workloads for Modern Deep Learning Methods},
author = {Robert Adolf and Saketh Rama and Brandon Reagen and Gu-Yeon Wei and David Brooks},
url = {http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2016/08/iiswc2016-final.pdf
http://arxiv.org/abs/1608.06581
https://rdadolf.github.io/fathom/},
year = {2016},
date = {2016-09-25},
booktitle = {IEEE International Symposium on Workload Characterization},
abstract = {Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community.
Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community.
Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.

This work presents a fully integrated 4-to-1 DC-DC symmetric ladder switched-capacitor converter (SLSCC) for voltage stacking applications. The SLSCC absorbs inter-layer load power mismatch to provide minimum voltage guarantees for the internal rails of a multicore system that implements four-way voltage stacking. A new hybrid feedback control scheme reduces the voltage ripple across stacked voltage layers for high levels of current mismatch, a condition that exacerbates voltage noise in conventional SC converters. Furthermore, the proposed SLSCC dynamically allocates valuable flying capacitor resources according to different load conditions, which improves conversion efficiency and supports more power mismatch between the layers. Implemented in TSMC’s 40G process, the SLSCC converts a 3.6 V input voltage down to four stacked output voltage layers, each nominally at 900 mV.

The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.

This paper describes a power electronics unit (PEU) for an insect-scale flapping-wing robot. Three power saving techniques used in the actuator driver of the PEU — envelope tracking, dynamic common mode, and charge sharing — reduce power consumption while retaining weight benefits of an inductor-less linear driver. A pair of actuator driver ICs energize four 15nF capacitor loads, which represent the piezoelectric actuators of a flapping-wing robot. The PEU consumes 290mW, which translates to 37% lower power compared to a design without these power saving techniques.

A body-coupled symmetric wakeup transceiver is proposed for always-on device discovery in IoT applications requiring security and low-power consumption. The wakeup transceiver (WTRx) is implemented in 65nm CMOS, using digital logic cells and operates at 0.6V. A directly-modulated open-loop DCO generates an OOK-modulated 10MHz carrier, with a frequency-locked loop for intermittent calibration. A passive receiver incorporates a digital IO cell as hysteretic comparator, with a two-phase correlator bank. A novel MAC scheme allows for duty-cycling in both transmitter and receiver. Measured power consumption is 3.54μW, with sensitivity of 88mV and maximum wakeup latency of 150ms.

@inproceedings{kanev15wsc,
title = {Profiling a Warehouse-Scale Computer},
author = {Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, David Brooks},
url = {http://www.eecs.harvard.edu/~skanev/papers/isca15wsc.pdf},
year = {2015},
date = {2015-06-15},
booktitle = {International Symposium on Computer Architecture (ISCA)},
abstract = {With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.

We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This \"datacenter tax\" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.

We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.

@conference{Xi2015_hpca,
title = {Quantifying Sources of Error in McPAT and Potential Impacts on Architectural Studies},
author = {Sam Xi and Hans Jacobson and Pradip Bose and Gu-Yeon Wei and David Brooks},
url = {http://www.samxi.org/papers/xi_hpca2015.pdf},
year = {2015},
date = {2015-02-07},
booktitle = {International Symposium on High Performance Computer Architecture (HPCA)},
abstract = {Architectural power modeling tools are widely used by the computer architecture community for rapid evaluations of high-level design choices and design space explorations. Currently, McPAT is the de facto power model, but the literature does not yet contain a careful examination of its modeling accuracy. In addition, the issue of how greatly power modeling error can affect architectural-level studies has not been quantified before. In this work, we present the first rigorous assessment of McPAT’s core power and area models with a detailed, validated power modeling toolchain used in current industrial practice. We find that McPAT’s predictions can have significant error because some of the models are either incomplete, too high-level, or assume implementations of structures that differ from that of the core at hand. We demonstrate that large errors are possible when using McPAT’s dynamic power estimates in the context of voltage noise and thermal hotspots, but for steady-state properties, accurately modeling leakage power is more important. Based on our analysis, we are able to provide guidelines for creating accurate McPAT models, even without access to detailed industrial power modeling tools. We conclude that in spite of its accuracy gaps, McPAT is still a very useful tool for many architectural studies, and its limitations can often be adequately addressed for a given research study of interest. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}

Architectural power modeling tools are widely used by the computer architecture community for rapid evaluations of high-level design choices and design space explorations. Currently, McPAT is the de facto power model, but the literature does not yet contain a careful examination of its modeling accuracy. In addition, the issue of how greatly power modeling error can affect architectural-level studies has not been quantified before. In this work, we present the first rigorous assessment of McPAT’s core power and area models with a detailed, validated power modeling toolchain used in current industrial practice. We find that McPAT’s predictions can have significant error because some of the models are either incomplete, too high-level, or assume implementations of structures that differ from that of the core at hand. We demonstrate that large errors are possible when using McPAT’s dynamic power estimates in the context of voltage noise and thermal hotspots, but for steady-state properties, accurately modeling leakage power is more important. Based on our analysis, we are able to provide guidelines for creating accurate McPAT models, even without access to detailed industrial power modeling tools. We conclude that in spite of its accuracy gaps, McPAT is still a very useful tool for many architectural studies, and its limitations can often be adequately addressed for a given research study of interest.

@conference{Campanoni2015_cgo,
title = {HELIX-UP: Relaxing Program Semantics to Unleash Parallelization},
author = {Simone Campanoni and Glenn Holloway and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~xan/lib/exe/fetch.php?media=research:cgo2015_paper.pdf},
year = {2015},
date = {2015-02-07},
booktitle = {International Symposium on Code Generation and Optimization (CGO)},
abstract = {Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program\'s results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}

Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis.

Increasing demand for power-efficient, high-performance computing has spurred a growing number and diversity of hardware accelerators in mobile Systems on Chip (SoCs) as well as servers and desktops. Despite their energy efficiency, fixed-function accelerators lack programmability, especially compared with general-purpose processors. Today’s accelerators rely on software-managed scratchpad memory and Direct Memory Access (DMA) to provide fixed-latency memory access and data transfer, which leads to significant chip resource and software engineering costs. On the other hand, hardware-managed caches with support for virtual memory and cache coherence are well-known to ease programmability in general-purpose processors, but these features are not commonly supported in today’s fixed-function accelerators. As a first step toward cache-friendly accelerator design, this paper discusses limitations of scratchpad-based memories in today’s accelerators, identifies challenges to support hardware-managed caches, and explores opportunities to ease the cache integration.

@conference{kanev14epb,
title = {Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications},
author = {Svilen Kanev and Kim Hazelwood and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~skanev/papers/iiswc14ep.pdf},
year = {2014},
date = {2014-10-26},
booktitle = {International Symposium on Workload Characterization (IISWC)},
publisher = {IEEE},
abstract = {The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) – the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Google’s datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}

The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) – the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Google’s datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.

This paper introduces the ShrinkFit accelerator framework, which simplifies the design of systems combining multiple accelerators. A single ShrinkFit system design can be deployed to FPGAs large and small, without time-consuming architectural parameter surveys. We describe four ShrinkFit accelerators implemented for an FPGA-based robotic bee brain prototype and demonstrate the flexibility of ShrinkFit with low performance overheads (under 10% on average) and low resource overheads (0-8% for accelerators and under 2% for hard logic blocks).

@inproceedings{Reagen2014,
title = {MachSuite: Benchmarks for Accelerator Design and Customized Architectures},
author = {Brandon Reagen and Robert Adolf and Sophia Yakun Shao and Gu-Yeon Wei and David Brooks},
url = {http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2016/09/machsuite.pdf},
year = {2014},
date = {2014-07-01},
booktitle = {IEEE International Symposium on Workload Characterization (IISWC)},
abstract = {Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection among projects and research groups. To provide standardization within the accelerator research community, we present MachSuite, a benchmark suite for high-level synthesis tools and accelerator-centric architectures. MachSuite is the compilation of carefully selected workloads to cover a diverse application space and algorithm choices. All the benchmarks in MachSuite are implemented to be well suited for high-level synthesis. A thorough characterization further demonstrates the diverse behaviors among benchmarks, representative of different customization challenges. MachSuite enables commensurability across research projects while mitigating the burden of accelerator implementation and workload selection.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection among projects and research groups. To provide standardization within the accelerator research community, we present MachSuite, a benchmark suite for high-level synthesis tools and accelerator-centric architectures. MachSuite is the compilation of carefully selected workloads to cover a diverse application space and algorithm choices. All the benchmarks in MachSuite are implemented to be well suited for high-level synthesis. A thorough characterization further demonstrates the diverse behaviors among benchmarks, representative of different customization challenges. MachSuite enables commensurability across research projects while mitigating the burden of accelerator implementation and workload selection.

@inproceedings{shao2013isca,
title = {Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures},
author = {Yakun Sophia Shao and Brandon Reagen and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~shao/papers/shao2014-isca.pdf},
year = {2014},
date = {2014-06-13},
booktitle = {International Symposium on Computer Architecture (ISCA)},
abstract = {Hardware specialization, in the form of accelerators that provide custom datapath and control for specific algorithms and applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerator analysis relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but are also slow and tedious to use, making large design space exploration infeasible. To overcome this problem, we present Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrate its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9%, 4.9%, and 6.6% with respect to RTL implementations. Integrated with architecture-level core and memory hierarchy simulators, Aladdin provides researchers an approach to model the power and performance of accelerators in an SoC environment.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Hardware specialization, in the form of accelerators that provide custom datapath and control for specific algorithms and applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerator analysis relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but are also slow and tedious to use, making large design space exploration infeasible. To overcome this problem, we present Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrate its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9%, 4.9%, and 6.6% with respect to RTL implementations. Integrated with architecture-level core and memory hierarchy simulators, Aladdin provides researchers an approach to model the power and performance of accelerators in an SoC environment.

Data dependences in sequential programs limit paralleliza- tion because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

@conference{campanoni2014xml,
title = {Breaking Cyclic-Multithreading Parallelization with XML Parsing},
author = {Simone Campanoni and Svilen Kanev and Kevin Brownell and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~skanev/papers/prism14xml.pdf},
year = {2014},
date = {2014-06-13},
booktitle = {International Workshop on Parallelism in Mobile Platforms (PRISM)},
abstract = {HELIX-RC, a modern re-evaluation of the cyclic-multithreading (CMT) compiler technique, extracts threads from sequential code automatically. As a CMT approach, HELIX-RC gains performance by running iterations of the same loop on different cores in a multicore. It successfully boosts performance for several SPEC CINT benchmarks previously considered unparallelizable. However, this paper shows there are workloads with different characteristics, which even idealized CMT cannot parallelize. We identify how to overcome an inherent limitation of CMT for these workloads. CMT techniques only run iterations of a single loop in parallel at any given time. We propose exploiting parallelism not only within a single loop, but also among multiple loops. We call this execution model Multiple CMT (MCMT), and show that it is crucial for auto-parallelizing a broader class of workloads. To highlight the need for MCMT, we target a workload that is naturally hard for CMT -- parsing XML-structured data. We show that even idealized CMT fails on XML parsing. Instead, MCMT extracts speedups up to 3.9x on 4 cores.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}

HELIX-RC, a modern re-evaluation of the cyclic-multithreading (CMT) compiler technique, extracts threads from sequential code automatically. As a CMT approach, HELIX-RC gains performance by running iterations of the same loop on different cores in a multicore. It successfully boosts performance for several SPEC CINT benchmarks previously considered unparallelizable. However, this paper shows there are workloads with different characteristics, which even idealized CMT cannot parallelize. We identify how to overcome an inherent limitation of CMT for these workloads. CMT techniques only run iterations of a single loop in parallel at any given time. We propose exploiting parallelism not only within a single loop, but also among multiple loops. We call this execution model Multiple CMT (MCMT), and show that it is crucial for auto-parallelizing a broader class of workloads. To highlight the need for MCMT, we target a workload that is naturally hard for CMT -- parsing XML-structured data. We show that even idealized CMT fails on XML parsing. Instead, MCMT extracts speedups up to 3.9x on 4 cores.

@inproceedings{shao2013islped,
title = {Energy Characterization and Instruction-Level Energy Model of Intel's Xeon Phi Processor},
author = {Yakun Sophia Shao and David Brooks},
url = {http://www.eecs.harvard.edu/~shao/papers/shao2013-islped.pdf},
year = {2013},
date = {2013-09-01},
booktitle = {International Symposium on Low Power Electronics and Design (ISLPED)},
abstract = {Intel’s Xeon Phi is the first commercial many-core/multi-thread x86-based processor. Xeon Phi belongs to a new breed of high performance computing processors that seek high compute density as well as energy efficiency. However, no high- level energy model is available for Xeon Phi software developers to quickly evaluate and optimize energy efficiency. This work demonstrates an instruction-level energy model for the Xeon Phi processor to facilitate the development of energy-efficient software. In order to construct this model, we first characterize the energy consumption of the processor, identifying how energy per instruction scales with the number of cores, the number of active threads per core, and instruction types. Based on the energy characterization, we construct an instruction-level energy model and validate the accuracy of the model between 1% and 5% for real world benchmarks. We show that the energy model can be used to identify software inefficiencies for these benchmarks and find that Linpack code can be optimized to increase energy efficiency by as much as 10%.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Intel’s Xeon Phi is the first commercial many-core/multi-thread x86-based processor. Xeon Phi belongs to a new breed of high performance computing processors that seek high compute density as well as energy efficiency. However, no high- level energy model is available for Xeon Phi software developers to quickly evaluate and optimize energy efficiency. This work demonstrates an instruction-level energy model for the Xeon Phi processor to facilitate the development of energy-efficient software. In order to construct this model, we first characterize the energy consumption of the processor, identifying how energy per instruction scales with the number of cores, the number of active threads per core, and instruction types. Based on the energy characterization, we construct an instruction-level energy model and validate the accuracy of the model between 1% and 5% for real world benchmarks. We show that the energy model can be used to identify software inefficiencies for these benchmarks and find that Linpack code can be optimized to increase energy efficiency by as much as 10%.

@inproceedings{reagen2013islped,
title = {Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware},
author = {Brandon Reagen and Yakun Sophia Shao and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~shao/papers/reagen2013-islped.pdf},
year = {2013},
date = {2013-09-01},
booktitle = {International Symposium on Low Power Electronics and Design (ISLPED)},
abstract = {As the traditional performance gains of technology scaling diminish, one of the most promising directions is building special purpose fixed function hardware blocks, commonly referred to as accelerators. Accelerators have become prevalent in industrial SoC designs for their low power, high performance potential. In this work we explore thousands of implementations of classical software workloads in hardware. This thorough, detailed design space search of hardware accelerators gives architects a quantita- tive way to reason about the differences in implementations. The exploration presented in this work shows that the space is full of poor design choices. By thoroughly analyzing each benchmark, we show which provide the most performance when implemented in hardware given a fixed power budget and explain which design techniques work best for each workload.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

As the traditional performance gains of technology scaling diminish, one of the most promising directions is building special purpose fixed function hardware blocks, commonly referred to as accelerators. Accelerators have become prevalent in industrial SoC designs for their low power, high performance potential. In this work we explore thousands of implementations of classical software workloads in hardware. This thorough, detailed design space search of hardware accelerators gives architects a quantita- tive way to reason about the differences in implementations. The exploration presented in this work shows that the space is full of poor design choices. By thoroughly analyzing each benchmark, we show which provide the most performance when implemented in hardware given a fixed power budget and explain which design techniques work best for each workload.

@inproceedings{shao2013ispass,
title = {ISA-Independent Workload Characterization and its Implications for Specialized Architectures},
author = {Yakun Sophia Shao and David Brooks},
url = {http://www.eecs.harvard.edu/~shao/papers/shao2013-ispass.pdf},
year = {2013},
date = {2013-04-01},
booktitle = {International Symposium on Performance Analysis of Systems and Software (ISPASS)},
abstract = {Specialized architectures will become increasingly important as the computing industry demands more energy- efficient designs. The application-centric design style for these architectures is heavily dependent on workload characterization of intrinsic program characteristics, but at the same time these architectures are likely to be decoupled from legacy ISAs. In this work, we perform ISA-independent workload characterization for a variety of important intrinsic program characteristics relating to computation, memory, and control flow. The analysis is performed using a JIT compiler that emits ISA-independent instructions. We compare this analysis with an x86 trace and find that several of the analyses are highly sensitive to the ISA. We conclude that designers of specialized architectures must adopt ISA-independent workload characterization approaches.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Specialized architectures will become increasingly important as the computing industry demands more energy- efficient designs. The application-centric design style for these architectures is heavily dependent on workload characterization of intrinsic program characteristics, but at the same time these architectures are likely to be decoupled from legacy ISAs. In this work, we perform ISA-independent workload characterization for a variety of important intrinsic program characteristics relating to computation, memory, and control flow. The analysis is performed using a JIT compiler that emits ISA-independent instructions. We compare this analysis with an x86 trace and find that several of the analyses are highly sensitive to the ISA. We conclude that designers of specialized architectures must adopt ISA-independent workload characterization approaches.

In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a Intel Core2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typical-case, resilient design.

Lowering the supply voltage to improve energy efficiency leads to higher load current and elevated supply sensitivity. In this paper, we provide the first quantitative analysis of voltage noise in multi-core near-threshold processors in a future 10nm technology across SPEC CPU2006 benchmarks. Our results reveal larger guardband requirement and significant energy efficiency loss due to power delivery nonidealities at near threshold, and highlight the importance of accurate voltage noise characterization for design exploration of energy-centric computing systems using near-threshold cores.

@article{lyons2012accelerator,
title = {The Accelerator Store: a shared memory framework for accelerator-based systems},
author = {Michael J Lyons and Mark Hempstead and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~mjlyons/papers/lyons-accelerator-store-taco-2012.pdf},
year = {2012},
date = {2012-01-01},
journal = {Transactions on Architecture and Code Optimization (TACO)},
publisher = {ACM},
abstract = {This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.

@inproceedings{campanoni2012helix,
title = {HELIX: Automatic parallelization of irregular programs for chip multiprocessing},
author = {Simone Campanoni and Timothy Jones and Glenn Holloway and Vijay Janapa Reddi and Gu-Yeon Wei and David Brooks},
url = {http://helix.eecs.harvard.edu/images/2/21/CGO2012_HELIX.pdf},
year = {2012},
date = {2012-01-01},
booktitle = {International Symposium on Code Generation and Optimization (CGO)},
organization = {ACM},
abstract = {We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.

@inproceedings{kanev2012xiosim,
title = {XIOSim: power-performance modeling of mobile x86 cores},
author = {Svilen Kanev and Gu-Yeon Wei and David Brooks},
url = {http://www.eecs.harvard.edu/~skanev/papers/islped12xiosim.pdf},
year = {2012},
date = {2012-01-01},
booktitle = {International symposium on Low power Electronics and Design (ISLPED)},
organization = {ACM},
abstract = {Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim –- a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSim’s performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim –- a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSim’s performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.

@inproceedings{campanoni2012helixb,
title = {The HELIX project: overview and directions},
author = {Simone Campanoni and Timothy Jones and Glenn Holloway and Gu-Yeon Wei and David Brooks},
url = {http://helix.eecs.harvard.edu/images/c/c4/DAC2012_Paper.pdf},
year = {2012},
date = {2012-01-01},
booktitle = {Design Automation Conference (DAC)},
organization = {ACM},
abstract = {Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.

@article{campanoni2012making,
title = {Making the Extraction of Thread-Level Parallelism Mainstream},
author = {Simone Campanoni and Timothy Jones and Glenn Holloway and G Wei and David Brooks},
url = {http://helix.eecs.harvard.edu/images/c/c7/IEEEMICRO2012_Paper.pdf},
year = {2012},
date = {2012-01-01},
journal = {IEEE Micro},
publisher = {IEEE},
abstract = {Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still do not parallelize code automatically. Promising parallelization approaches have either required manual programmer assistance, depended on special hardware features, or risked slowing down programs they should have speeded up. HELIX is one such approach that automatically parallelizes general-purpose programs without requiring any special hardware. In this paper we show that in practice HELIX always avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers. We also show experimentally that HELIX outperforms the most similar historical technique that has been implemented in production compilers.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still do not parallelize code automatically. Promising parallelization approaches have either required manual programmer assistance, depended on special hardware features, or risked slowing down programs they should have speeded up. HELIX is one such approach that automatically parallelizes general-purpose programs without requiring any special hardware. In this paper we show that in practice HELIX always avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers. We also show experimentally that HELIX outperforms the most similar historical technique that has been implemented in production compilers.