HPCwire » energy efficiencyhttp://www.hpcwire.com
Since 1986 - Covering the Fastest Computers in the World and the People Who Run ThemSun, 02 Aug 2015 12:39:43 +0000en-UShourly1http://wordpress.org/?v=4.2.3NERSC Highlights Exascale-Energy Connectionhttp://www.hpcwire.com/2014/09/11/nersc-highlights-exascale-energy-connection/?utm_source=rss&utm_medium=rss&utm_campaign=nersc-highlights-exascale-energy-connection
http://www.hpcwire.com/2014/09/11/nersc-highlights-exascale-energy-connection/#commentsThu, 11 Sep 2014 21:56:02 +0000http://www.hpcwire.com/?p=15116It’s been almost six months since the National Energy Research Scientific Computing Center (NERSC) formally announced Cori, the new supercomputer set to be installed at the lab in the mid-2016 timeframe. Originally known as NERSC-8, the new machine will sport over 9,300 nodes, equipped with next-generation Knights Landing architecture housed within a Cray XC environment. Each Read more…

]]>It’s been almost six months since the National Energy Research Scientific Computing Center (NERSC) formally announced Cori, the new supercomputer set to be installed at the lab in the mid-2016 timeframe. Originally known as NERSC-8, the new machine will sport over 9,300 nodes, equipped with next-generation Knights Landing architecture housed within a Cray XC environment. Each of these chips is capable of delivering more than three teraflops of double precision performance, which altogether is enabling a ten-fold application performance over Hopper (aka NERSC-6), for an estimated peak system performance in in the neighborhood of 30 petaflops.

The DOE’s Advanced Scientific Computing Research (ASCR), which funds and manages the NERSC facility, recently published an article exploring how Cori will support research into exascale concepts.

Performance enhancement is only one part of the story. Equally important to this goal is energy efficiency. This means doing more computing with less energy – and it’s one of the essential design elements in the development of Cori.

“Our science users are telling us they need to get to the level of hundreds of petaflops (quadrillion calculations per second) within this decade, and the only way we can get to that level of computing is to make this transition to energy-efficient architectures,” says NERSC Director Sudip Dosanjh.

“With Cori we will begin transitioning the broad range of (DOE) Office of Science codes that run at NERSC to energy-efficient many-core computer architectures.”

The Knights Landing coprocessor, with its emphasis on optimizing the FLOPS-per-watt equation, is at the heart of this strategy.

Knights Landing is expected to provide between 14 and 16 gigaflops per watt of energy. Compare this to the reining Green500 champ, Tsubame-KFC. So far this system from the Tokyo Institute of Technology is the only one to break the 4 gigaflops-per-watt mark.

Going forward, revving clock speed will no longer be an effective way to wring out more performance because all those clock ticks draw energy and release heat meaning additional energy is needed to power and cool the machine, adding up to as much as tens of millions of dollars each year.

Intel’s approach for Knights Landing is to put lots of energy-efficient cores (up to 72) on the same chip. Dosanjh says this translates to “a lot more computing using just a little more power.”

Memory was another design focus. Dosanjh explains that NERSC applications are hindered more by memory bandwidth and data movement than by floating point capabilities.

Cori will provide over 400 gigabits per second of I/O bandwidth and 28 petabytes of disk space. The upcoming Knights Landing chip employs three-dimensional on-package memory developed by Intel and Micron, that is projected to deliver five times more bandwidth than DDR4 memory, helping to alleviate I/O bottlenecks. The system will also include a layer of Non-Volatile Random-Access Memory (NVRAM), called a Burst Buffer, that will move data more quickly between processors and disk, a boon for data-intensive science.

All told Cori will provide NERSC’s more than 5,000 users with approximately 30 petaflops of peak performance to support an estimated 650 applications geared to a wide-range of science and engineering disciplines. The NERSC Exascale Science Applications Program (NESAP) will be instrumental in getting these codes to take advantage of the manycore architecture and the additional memory layers, says Dosanjh.

When Cori arrives in 2016, it will have be housed inside Lawrence Berkeley National Laboratory’s Computational Research and Theory building. This energy-efficient center, which relies on San Francisco Bay air for free cooling, is currently being constructed on the main Berkeley Lab campus.

]]>http://www.hpcwire.com/2014/09/11/nersc-highlights-exascale-energy-connection/feed/0Tackling the Power and Energy Wall for Future HPC Systemshttp://www.hpcwire.com/2013/12/17/tackling-power-energy-wall-future-hpc-systems/?utm_source=rss&utm_medium=rss&utm_campaign=tackling-power-energy-wall-future-hpc-systems
http://www.hpcwire.com/2013/12/17/tackling-power-energy-wall-future-hpc-systems/#commentsTue, 17 Dec 2013 17:06:12 +0000http://www.hpcwire.com/?p=2697As the cost of powering a supercomputer or a datacenter increases, next generation exascale systems need to be considerably more power- and energy-efficient than current supercomputers to be of practical use. Unlike petascale systems, where the primary concern was performance, exascale systems need to climb the power and energy walls in order to deliver sustainable exaflops performance. Pacific Northwest Laboratory (PNNL) researchers are exploring holistically energy and power efficiency aspects at all levels of granularity, from processor architecture to system integration.

As the cost of powering a supercomputer or a datacenter increases, next generation exascale systems need to be considerably more power- and energy-efficient than current supercomputers to be of practical use. Constrained power consumption (20-25MW for the entire system is the target that the DOE Office of Science gave to the HPC community) is one of the limiting factors on the road to achieve sustainable performance at exascale. In fact, the power challenge is so fundamental that other challenges can be reduced to power limitations. For example, operating at near-threshold voltage (NTV) in order to perform computation within a given power budget may considerably increase the soft-error rate (resilience challenge). Unlike petascale systems, where the primary concern was performance, exascale systems need to climb the power and energy walls in order to deliver sustainable exaflops performance. At the Pacific Northwest Laboratory (PNNL) we are exploring holistically energy and power efficiency aspects at all levels of granularity, from processor architecture to system integration. We are also tackling the power and energy problems from several angles, from system software and programming models to performance and power modeling of scientific applications and extreme scale systems.

PNNL’s computing facilities, such as its Institutional HPC system (PIC), and an earlier testbed, the Energy Smart Data Center (ESDC), provide research platforms to address what-if questions related to the use of suitable datacenter metrics that are meaningful to the HPC community. The measurement harness of the ESDC entailed over thousand out-of-band sensors comprising power, flow, pressure and temperature at the machine room and at the IT equipment. PIC is another example that substantiates our integrated datacenter vision to drive energy efficiency research. This system is housed in a geothermally cooled datacenter with rear-door heat exchangers. The facility is instrumented at the machine room and system level, providing insight into macro-level machine room power efficiencies and micro-level energy efficiencies at the server and mother-board component levels.

Despite its importance for future exascale systems, power is still not considered a first-class citizen, which complicates the development of power-aware software algorithms. In PNNL’s vision, power should be considered a resource, just as processing elements or memory modules, and should be managed as such by the system software. System software must be able to precisely measure (in-band) power resource utilization, i.e., how much power is consumed by each system component at any given time. More importantly, the system software should adapt the application to the contingent execution environment, e.g, by allocating sustained power to threads on the application’s critical path or promptly moving idle cores to low-power states. The design and development of such self-aware/self-adaptive system software is an active research area at PNNL. We recently analyzed the power characteristics of scientific applications from the DOE ASCR’s Exascale Co-Design Center and, in general, in the HPC community to identify opportunities for power savings. Given the lack of in-band, fine-grained (both in space and time) power sensors, we develop an accurate per-core proxy power sensor model that estimates the active power of each core by inspecting the cores’ activity. We use statistical regression techniques to formulate closed-form expressions for the estimated core and system power consumption. These techniques enable us to develop power-aware algorithms and characterize applications running even on non-instrumented compute nodes. Our experiments show that processes in the same application may not have the same power profile and/or may alternate high-power with low-power phases independently from one another. These alternating behaviors raise opportunities for shifting power towards computing-demanding processes, hereby saving power without diminishing performance.

There is a strong agreement among researchers on the increasing cost of data movement with respect to computation. This ratio will further increase in future systems that will approach NTV operation levels: the energy consumption of a double precision register-to-register floating point operation is expected to decrease by 10x by 2018. The energy cost of moving data from memory to processor is not expected to follow the same trend, hence the relative energy cost of data movement with respect to performing a register-to-register operation will increase (energy wall — analogous to the memory wall). In a recent study we modeled the energy cost of moving data across the memory hierarchy of current systems and analyzed the energy cost of data movement for scientific applications. In this study, we answer several important questions such as what is the amount of energy spent in data movement with respect to the total energy consumption of an application or what is the dominant component of data movement energy for current and future parallel applications. Our results show that the energy cost of data movement impact differently on each application, ranging from 18% to 40%. This percentage might increase in the future, as the energy cost of performing computation decreases. To avoid such scenario, new technologies, such as Processing-In- Memory, Non-Volatile RAM and 3D-stacked memory, become essential for the development of sustainable exascale computing. We also noticed that the energy spent in resolving data dependency, speculation and out-of-order scheduling of instructions accounts for a considerable part of the total dynamic energy, between 22% and 35%. This cost can be reduced with simpler processor core designs that are more energy efficient.

Given the increasing complexity of future exascale applications and systems, designers need new sophisticated tools to navigate the design space. These tools must capture a range of metrics that are of interest to system and application designers, including performance and power consumption. PNNL has historically developed application-specific performance tools that model the evolution of parallel applications. While these models have shown themselves to be powerful tools for understanding the mapping of applications to complex system architectures, the metrics of interest are expanding to include power consumption as well. To this end, PNNL researchers have developed a methodology for the modeling of performance and power in concert that builds upon its experience of co-designing systems and applications. This modeling capability has been developed along three axes. The first is the deployment of a workload-specific quantitative power modeling capability. Such power models accurately capture workload phases, their impact on power consumption, and how they are impacted by system architecture and configuration (e.g., processor clock speed). The second axis is the integration of the performance and power modeling methodologies. To this end, it is critical that both modeling methods operated at the same conceptual level. In other words, application phases or components that are captured in one model must be also reflected in the other so that trade-offs between power and performance may be captured and quantified. The last axis of development involves integrating these models with our self-aware/self-adaptive software system that will provide mechanisms for dynamically optimizing ongoing application execution. We have developed the concept of Energy Templates, which are a mechanism for passing application-specific behavioral information to the underlying runtime layers. Energy Templates capture per-core idle/busy states, as well as the amount of time each core expects to remain in each state, allowing runtime software to determine appropriate opportunities to exercise power saving features provided by the hardware/software platform (e.g., Dynamic Voltage and Frequency Scaling — DVFS) without negatively impacting performance. By proactively using application-specific information, Energy Templates are able to exploit energy savings opportunities that are not available to mechanisms that are not application-aware.

The research at PNNL is also being applied within the new DARPA program in the Power Efficiency Revolution of Embedded Technologies (PERFECT). We see that technologies being developed both for high performance computing and embedded systems are fundamentally the same. These may well converge in the future, and thus common tools and techniques can be developed that encompass both. Within PERFECT PNNL researchers are developing a coherent framework that is able to both empirically analyze current systems and predictively assess future technologies.

Finally, PNNL’s research extends to the datacenters: this research direction is approached in an integrated fashion where IT power consumption for applications of interest to the DOE is correlated with the power consumption of the supporting infrastructure. An integrated approach allows the researcher to formulate what-if questions in an HPC setting such as the applicability and efficacy of novel cooling solutions (e.g., spray cooling) at the heat source vs. a traditional global cooling solution.

Overall, PNNL is actively participating in (and in many cases leading) several DOE and DARPA projects, as well as internal projects, that aim at understanding the impact of the power and energy walls on exascale systems and deploying power- and energy-aware solutions at all levels of the system and application design and optimization. The insights gained throughout these efforts and projects will contribute towards the design of power- and energy-efficient exascale systems.

]]>http://www.hpcwire.com/2013/12/17/tackling-power-energy-wall-future-hpc-systems/feed/0Exascale Requires 25x Boost in Energy Efficiency, NVIDIA’s Dally Sayshttp://www.hpcwire.com/2013/06/24/exascale_requires_25x_boost_in_energy_efficiency_nvidias_dally_says/?utm_source=rss&utm_medium=rss&utm_campaign=exascale_requires_25x_boost_in_energy_efficiency_nvidias_dally_says
http://www.hpcwire.com/2013/06/24/exascale_requires_25x_boost_in_energy_efficiency_nvidias_dally_says/#commentsMon, 24 Jun 2013 07:00:00 +0000http://www.hpcwire.com/?p=3993While there is a universal desire in the HPC community build the world's exascale system, the achievement will require a major breakthrough in not only chip design and power utilization but programming methods, NVIDIA chief scientist Bill Dally said in a keynote address at ISC 2013 last week in Leipzig, Germany.

]]>While there is a universal desire in the HPC community build the world’s exascale system, the achievement will require a major breakthrough in not only chip design and power utilization but programming methods, NVIDIA chief scientist Bill Dally said in a keynote address at ISC 2013 last week in Leipzig, Germany.

In last Monday’s speech, titled “Future Challenges of Large-scale Computing,” Dally outlined what needs to happen to achieve an exascale system in the next 10 years. According to Dally, who is also a senior vice president of research at NVIDIA and a professor at Stanford University, it boils down to two issues: power and programming.

Power may present the biggest dilemma to building an exascale system, which is defined as delivering 1 exaflop (or 1,000 petaflops) of floating point operations per second. The world’s largest rated supercomputer is the new Tianhe-2, which recorded 33.8 petaflops of computing capacity in the latest Top 500 list of the world’s largest supercomputers, while consuming nearly 18 megawatts of electricity. It has a theoretical peak of nearly 55 petaflops.

Theoretically, an exascale system could be built using only x86 processors, Dally said, but it would require as much as 2 gigawatts of power. That’s equivalent to the entire output of the Hoover Dam, Dally said, according to an NVIDIA blog post on the keynote.

Using GPUs in addition to X86 processors is a better approach to exascale, but it only gets you part of the way. According to Dally, an exascale system built with NVIDIA Kepler K20 co-processors would consume about 150 megawatts. That’s nearly 10 times the amount consumed by Tianhe-2, which is composed of 32,000 Intel Ivy Bridge sockets and 48,000 Xeon Phi boards.

Instead, HPC system developers need to take an entirely new approach to get around the power crunch, Dally said. The NVIDIA chief scientist said reaching exascale will require a 25x improvement in energy efficiency. So the 2 gigaflops per watt that can be squeezed from today’s systems needs to improve to about 50 gigaflops per watt in the future exascale system.

Relying on Moore’s Law to get that 25x improvement is probably not the best approach either. According to Dally, advances in manufacturing processes will deliver about a 2.2x improvement in performance per watt. That leaves an energy efficiency gap of 12x that needs to be filled in by other means.

Dally sees a combination of better circuit design and better processor architectures to close the gap. If done correctly, these advances could deliver 3x and 4x improvements in performance per watt, respectively.

According to NVIDIA’s blog, Dally is overseeing several programs in the engineering department that could deliver energy improvements, including: utilizing hierarchical register files; two-level scheduling; and optimizing temporal SIMT.

Improving the arithmetic capabilities of processors will only get you so far in solving the power crunch, he said. “We’ve been so fixated on counting flops that we think they matter in terms of power, but communication inside the system takes more energy than arithmetic,” Dally said. “Power goes into moving data around. Power limits all computing and communication dominates power.”

Besides addressing the power crunch, the way that supercomputers are programmed today also serves as an impediment to exascale systems.

Programmers today are overburdened and try to do too much with a limited array of tools, Dally said. A strict division of labor should be instituted among the triumvirate of programmers, tools, and the architecture to drive efficiency into HPC systems.

The best result is delivered when each group “plays their positions,” he said. Programmers ought to spend their time writing better algorithms and implementing parallelism instead of worrying about optimization or mapping, which are better off handled by programming tools. The underlying architecture should just provide the underlying compute power, and otherwise “stay out of the way,” Dally said according to the NVIDIA blog.

Dally and his team are investigating the potential for items such as collection-oriented programming methods to make programming supercomputers easier. Exascale-sized HPC systems are possible in the next decade if these limitations are addressed, he said.

]]>In this week’s hand-picked assortment, researchers explore the path to more energy-efficient cloud datacenters, investigate new frameworks and runtime environments that are compatible with Windows Azure, and design a uniﬁed programming model for diverse data-intensive cloud computing paradigms.

Cloud and datacenter are converging terms, and as “the cloud” grows so does the datacenter. The move to centralized pools of computing resources has fast-tracked the creation of mega-sized datacenters with the energy appetites to match. It’s an issue that’s becoming more and more critical – as resources are limited and power requirements grow. The subject area will require significant research and development and computer scientists are stepping up to the plate to identify innovative solutions. One of these researchers is Jing SiYuan from the School of Computer Science at Leshan Normal University in China.

In a new research paper SiYuan observes that the most common way of saving energy is a dynamic on-demand resource provisioning system that shuts off idle servers to save energy. But it’s also important to maximize resource utilization by minimizing the number of servers that are required at a given time. The paper lays out a method to both minimize energy consumption and optimize VM migration. The proposed solution utilizes a network-flow-theory based approximate algorithm.

The findings show that compared to existing approaches, the algorithm can slightly decrease the energy consumption but greatly decrease the number of VM placement change (by almost 75 percent). By limiting VM migrations and starts/stops, resource overhead is reduced and system performance is improved.

Georgia State University researcher Dinesh Agarwal is adding to the growing body of information on HPC cloud with his dissertation on the “Scientific High Performance Computing (HPC) Applications On The Azure Cloud Platform.”

The accessibility of cloud computing resources like Amazon Web Services and Windows Azure makes for an attractive option for researchers across a wide range of discipline. Elasticity, pay-per-use and on-demand provisioning are desirable traits, but when it comes to performance, these offerings do not come with any guarantees. There’s also a lack of development tools, which hamper their use for HPC purposes. Insufficient portability is another roadblock.

“Among all clouds,” writes Agarwal, “the emerging Azure cloud from Microsoft in particular remains a challenge for HPC program development both due to lack of its support for traditional parallel programming support such as Message Passing Interface (MPI) and map-reduce and due to its evolving application programming interfaces (APIs).”

In light of this, Agarwal and his team created new frameworks and runtime environments to assist HPC application developers. The idea is to provide developers with tools like the ones from traditional parallel and distributed computing environments, such as MPI, to use for scientific application development on the Azure cloud platform. Agarwal notes that creating an efficient framework for any cloud platform is a challenging problem because the services are offered as a black-boxes accessible only via application programming interfaces (APIs).

The main components of this PhD thesis are: “(i) creating a generic framework for bag-of-tasks HPC applications to serve as the basic building block for application development on the Azure cloud platform, (ii) creating a set of APIs for HPC application development over the Azure cloud platform, which is similar to message passing interface (MPI) from traditional parallel and distributed setting, and (iii) implementing Crayons using the proposed APIs as the first end-to-end parallel scientific application to parallelize the fundamental GIS operations.”

Faced with the need to process large volumes of data, researchers have several computational paradigms to select from, including batch processing, iterative, interactive, memory-based, data flow oriented, relational, structured, among others. These different techniques are mostly incompatible with each other, but what if there was a unified framework that could support these different approaches? That’s exactly what research duo Maneesh Varshney and Vishwa Goudar from the Computer Science Department of the University of California, Los Angeles, had in mind when they developed Blue.

The researchers lay out their findings in a new technical report, “Blue: A Uniﬁed Programming Model for Diverse Data-intensive Cloud Computing Paradigms.”

They write: “The motivation for this paper is to ease the development of new cluster applications, by introducing an intermediate layer (Figure 1) between resource management and applications. This layer [serves as] a generic programming model upon which any arbitrary cluster application can be built. Not only will this significantly diminish the cost of developing applications, the users will be able to easily select the computation paradigm that best meets their needs.”

In developing the Blue framework and programming model, the researchers aimed for a solution that was neither too low-level and difficult to implement, nor too high-level and fixed. The paper includes an outline for implementation strategy, and points out the framework’s key strengths (notably efficiency and fault-tolerance for cluster programs) and limitations (while it targets data-intensive computational problems, it is not the best choice for task parallelism).

]]>http://www.hpcwire.com/2013/05/10/research_roundup_toward_a_more_efficient_cloud/feed/0The Week in HPC Researchhttp://www.hpcwire.com/2013/04/25/the_week_in_hpc_research-3/?utm_source=rss&utm_medium=rss&utm_campaign=the_week_in_hpc_research-3
http://www.hpcwire.com/2013/04/25/the_week_in_hpc_research-3/#commentsThu, 25 Apr 2013 07:00:00 +0000http://www.hpcwire.com/?p=4107We've scoured the journals and conference proceedings to bring you the top research stories of the week. This diverse set of items includes advancements in petascale-era development environments; balancing performance with power efficiency; optimizing computer science instruction; and a possible path to extreme heterogeneity.

]]>We’ve scoured the journals and conference proceedings to bring you the top research stories of the week. This diverse set of items includes advancements in petascale-era development environments; the challenges of energy-efficiency in HPC; optimizing computer science instruction; and a possible path to extreme heterogeneity.

A Scalable Development Environment for Petascale Era

The Juelich Supercomputing Centre (JSC) at Forschungszentrum Juelich GmbH, in Germany, has released the final scientific report detailing its efforts to develop “A scalable Development Environment for Peta-Scale Computing.” The goal of the project was to extend the Parallel Tools Platform (PTP) – an integrated development environment for parallel applications – to meet the needs of current-era petascale systems. PTP covers code analysis, performance tuning, parallel debugging and system monitoring.

The role of the Juelich Supercomputing Centre (JSC) was to provide a scalable system modeling solution for today’s supercomputers. This meant developing a new communication protocol for status data to be exchanged between the target remote system and the client running PTP. Remote support was essential as PTP provides transparent access to multiple remote systems via a unified interface.

The nature of the challenge is described thusly:

“The common requirement for all PTP components is that they have to interact with the remote supercomputer, e.g., applications are built remotely and performance tools are attached to job submissions and their output data resides on the remote system. Status data has to be collected by evaluating outputs of the remote job scheduler and the parallel debugger needs to control an application executed on the supercomputer. The challenge is to provide this functionality for peta-scale systems in real-time.”

The remainder of the paper describes the process by which JSC developed the new monitoring component and successfully integrated it into PTP. The solution is now being used on JSC’s BlueGene/Q system JUQUEEN, as well as its general purpose cluster JUROPA and its GPU cluster JUDGE. It’s also been successfully applied to Jaguar, the Cray supercomputer maintained by the Oak Ridge National Laboratory (now part of Titan), and various XSEDE machines, including the Kraken and Keeneland systems at the National Institute for Computational Sciences, the Lonestar and Ranger systems at Texas Advanced Computing Center, as well as Argonne National Laboratory’s Blue Gene/P and Q.

Another research paper released this week demonstrates novel energy savings strategies for parallel applications by way of point to point communication phases.

“Although high-performance computing traditionally focuses on the efficient execution of large-scale applications, both energy and power have become critical concerns when approaching exascale,” state the four-person research team (from Iowa State University and Old Dominion University, Norfolk, Va.).

Fig. 2. State diagram for runtime procedure to apply energy savings efficiently. The transitions are labeled with Lt, where L takes a value of the first 11 letters of the alphabet. The transition of a state into itself (At,Et,Ft,It) indicate ongoing state action.

“In modern microprocessor architectures, equipped with dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (throttling), the power consumption may be controlled in software. Additionally, network interconnect, such as Infiniband, may be exploited to maximize energy savings while the application performance loss and frequency switching overheads must be carefully balanced. This paper advocates for a runtime assessment of such overheads by means of characterizing point-to-point communications into phases followed by analyzing the time gaps between the communication calls.”

The tests employ NAS parallel benchmark problems and calculations performed by the quantum chemistry software package GAMESS. In the final analysis, the team achieved close to the maximum energy savings, however there was a small performance loss of 2 percent.

Their work appears in the latest edition of the Journal of Parallel and Distributed Computing.

The power wall is one of the biggest challenges facing the HPC community. While these mega-machines are essential to research and business, they also are also big energy consumers. This issue, however is getting a lot of attention, and optimizing performance-per-watt has become a key goal of the computing industry at all levels.

A team of UK researchers have written about the advances that will be needed over the coming years, observing that achieving a “pervasively energy-efficient” supercomputing architecture will require improvements in multiple fields. They believe that the LOEWE-CSC supercomputer at the University of Frankfurt, Germany, has already made a lot of headway in meeting these goals. That system, they write, “is setting new standards in environmental compatibility as well as energy and cooling efficiency for high-performance and general-purpose computing.”

The team notes that GPUs provide more compute performance per watt versus standard processors, while “a balanced hardware configuration ensures that most of the compute power is available to the user when he employs optimized applications.” As well: “clever algorithms enable the user to fully exploit the computational potential and avoids to waste power when the processors idles, which is often a cause of inefficient programming.”

The LOEWE-CSC supercomputer achieved 740 MFlops-per-watt on a Linpack benchmark run, earning it an eighth place finish on the Green500 list of November 2010. A good metric for the time, it has since been surpassed by more energy-efficient systems and has fallen to 109th position on the most recent Green500 list (November 2012).

The work appears in proceedings from the 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2013.

Four researchers from across the country have written a paper that is sure to resonate with anyone who’s ever taken or taught a computer science course. In “Evaluating Student Understanding of Core Concepts in Computer Architecture,” the authors begin with the assertion: “Many studies have demonstrated that students tend to learn less than instructors expect in CS1.”

The researchers wondered whether these findings would hold true for subsequent, upper-level computer science courses, and set out to test their hypothesis.

Multiple computer architecture instructors developed basic concept questions for upper-division computer architecture courses. The questions were designed to test students’ minimum proﬁciency levels post-course and the expectation was that every student would be able to answer the questions. The tests were used to assess four separate computer architecture courses (taught by four different teachers) at two institutions, a large public university and a small liberal arts college.

The results in the authors’ words: “Our results show that students in these courses were indeed not learning as much as the instructors expected, performing poorly overall: the per-question average was only 56%, with many questions showing no statistically signiﬁcant improvement from precourse to post-course. While these results follow the trend from CS1 courses, they are still somewhat surprising given that the courses studied were taught using research-based pedagogy that is known to be eﬀective across the CS curriculum.”

The paper includes a discussion of the findings as well as recommendations for further study. While this may come as “bad news,” pinpointing the most difficult subject matter will help course instructors refine their lessons (see the Recommendations section for more on this topic).

It’s no question these findings are significant – one wonders how surprising they will be to the HPC or larger computer science community.

This paper opens the door for further discourse on this important subject.

The last decade has seen continuing push toward heterogenous architectures, but is there a more extreme form of heterogeneity still to come? There is according to one group of computer scientists. The diverse research team, with affiliations that include Microsoft as well as US, Mexican, European and Asian universities, presented a paper on the subject at the International Symposium on Pervasive Systems, Algorithms and Networks (I-SPAN’ 2012) in San Marcos, Texas, December 13–15, 2012.

In “Introducing the Extreme Heterogeneous Architecture,” they write:

“The computer industry is moving towards two extremes: extremely high-performance high-throughput cloud computing, and low-power mobile computing. Cloud computing, while providing high performance, is very costly. Google and Microsoft Bing spend billions of dollars each year to maintain their server farms, mainly due to the high power bills. On the other hand, mobile computing is under a very tight energy budget, but yet the end users demand ever increasing performance on these devices.”

Conventional architectures have diverged to meet the needs of multiple user groups. But wouldn’t it be ideal if there was a way to deliver high-performance and low power consumption at the same time? The authors set out to explore a novel architecture model that addresses both these extremes, setting the stage for the Extremely Heterogeneous Architecture (EHA) project.

“EHA is a novel architecture that incorporates both general-purpose and specialized cores on the same chip,” the authors explain. “The general-purpose cores take care of generic control and computation. On the other hand, the specialized cores, including GPU, hard accelerators (ASIC accelerators), and soft accelerators (FPGAs), are designed for accelerating frequently used or heavy weight applications. When acceleration is not needed, the specialized cores are turned off to reduce power consumption. We demonstrate that EHA is able to improve performance through acceleration, and at the same time reduce power consumption.”

As a heterogeneous architecture, EHA is capable of accelerating heterogeneous workloads on the same chip. This is useful because it is often the case that datacenters (either in-house or in “the cloud”) provide many services – media streaming, searching, indexing, scientific computations, and so on.

The EHA project has two main goals. The first one is to design a chip that is suitable for many different cloud services, thereby greatly reducing both recurring and non-recurring costs of datacenters or clouds. Second, they plan to implement a light-weight EHA for use with mobile devices, with the aim of optimizing user experience under tight power constraints.

]]>http://www.hpcwire.com/2013/04/25/the_week_in_hpc_research-3/feed/0HPC Center Meets Green Mandatehttp://www.hpcwire.com/2013/04/24/hpc_center_meets_green_mandate/?utm_source=rss&utm_medium=rss&utm_campaign=hpc_center_meets_green_mandate
http://www.hpcwire.com/2013/04/24/hpc_center_meets_green_mandate/#commentsWed, 24 Apr 2013 07:00:00 +0000http://www.hpcwire.com/?p=4091For HPC centers all over the world, providing high-levels of compute power is job number one, but they also must be as energy-efficient as possible.

]]>For HPC centers all over the world, supporting cutting-edge research by providing users with high-levels of compute power is job number one, but nowadays centers are also expected to be as energy-efficient as possible. Such is the case with Netherlands-based HPC center SURFsara. When the center needed to reduce its energy consumption and improve uptime, it turned to Dell Deployment Services and Dell Configuration Services.

SURFsara relies on both government and university funding to provide university groups with the computing power they need to perform ground-breaking research. In addition to this mandate, the center must ensure that government efficiency-efficiency guidelines are followed.

“We help them process large volumes of data, fast,” says Jaap Dijkshoorn, Group Leader of Cluster Computing, SURFsara, The Netherlands. “[But we also] try to be as green as possible.”

When SURFsara’s existing cluster was reaching the end of its lifecycle, the center sought a replacement that emphasized performance-per-watt. After a European tender process, they found that Dell best met their needs based on performance, power consumption and price.

The new HPC cluster, called Lisa, consists of 624 Dell blade servers, Dell Force10 switches and Dell PowerVault direct attached storage array. It is providing researchers with a 50 percent increase in computing power – from 20 teraflops to 30 teraflops – and around-the clock-access to computing with 99 percent uptime.

The new system has also helped SURFsara meet the green IT targets set by the Dutch government. The country is aiming to reduce carbon emissions by 20 percent by 2020 and expects businesses to do their share in making this happen. Thanks to the energy-efficient Dell blades, the amount of electricity that goes into powering SURFsara has dropped by 40 percent, from 250 kilowatts to 150 kilowatts.

Observes Dijkshoorn: “Not only does that mean that as an organisation we’re being as green as possible, but the fact that we’re significantly reducing our electricity bills should also see us make savings that we can reinvest in developing cutting-edge technology services.”

]]>http://www.hpcwire.com/2013/04/24/hpc_center_meets_green_mandate/feed/0Future Challenges of Large-Scale Computinghttp://www.hpcwire.com/2013/04/15/future_challenges_of_large-scale_computing/?utm_source=rss&utm_medium=rss&utm_campaign=future_challenges_of_large-scale_computing
http://www.hpcwire.com/2013/04/15/future_challenges_of_large-scale_computing/#commentsMon, 15 Apr 2013 07:00:00 +0000http://www.hpcwire.com/?p=4112Ahead of his opening conference keynote at ISC'13, Bill Dally, chief scientist at NVIDIA and senior vice president of NVIDIA Research, shares his views on where HPC is headed. Among the key topics covered are the demand for heterogenous computing, overcoming the memory wall, the implications of government belt-tightening, and much more...

]]>Now in its 28th year, the International Supercomputing Conference, ISC’13, is fast approaching. On Monday, June 17, Bill Dally, chief scientist at NVIDIA and senior vice president of NVIDIA Research, will deliver the opening keynote, titled “Future Challenges of Large-Scale Computing.”

Dally will address the multiple advances that will be necessary in order for the community to achieve the potential of HPC and data analytics going forward. The thrust of his talk will be on the challenges around power, programmability, and scalability, and most notably the role that energy-efficiency will play in determining system performance.

ISC’13 will be held from June 16-20, 2013, at the Congress Center Leipzig (CCL) in Leipzig, Germany.

In this Q&A Mr. Dally shares his views on where HPC is headed in the context of such important topics as heterogenous computing, the memory wall, government belt-tightening, and more…

HPCwire: Are different types of workloads, such as big data, HPC and Web 2.0, beginning to demand different types of processors? Will server processors diversify over the next five to ten years, or will they converge?

Bill Dally: HPC, Web servers, and big data all have similar requirement for processors. Within these applications there are program segments that are limited by single-thread performance and other segments that are limited by throughput. To meet this need, there will be a convergence on heterogeneous multicore processors where each “socket” will contain a small number of cores optimized for latency (like today’s CPU cores) and many more cores optimized for throughput (like today’s GPU cores).

HPCwire: The increase in processor performance seems to be outpacing memory technology. What can be done about the memory wall?

Dally: There are three aspects of memory relevant here: bandwidth, latency and capacity. To address the slow scaling of memory bandwidth we plan to move to memory technologies that involve placing memory dice on the same package as the processor chip and connecting them with very high-bandwidth, low-energy links. This on-package memory technology will enable us to scale memory bandwidth with processor performance holding the Byte/FLOP ratio roughly constant for the next few generations.

Memory latency is remaining roughly constant as processor performance increases. We deal with this by increasing parallelism to hide the latency. With adequate parallelism we can keep the memory pipeline full – using all of the available bandwidth.

Memory capacity is largely a matter of cost. The challenge here is that high-bandwidth memories, like on-package or stacked DRAM, cost significantly more than commodity memory. Thus, for cost-sensitive applications we are likely to see a two-tiered memory system with a moderate capacity, high-bandwidth on-package memory and a high-capacity commodity memory. A non-volatile memory technology like flash or phase-change memory could have a place in such a hierarchy as well.

HPCwire: How important will 3D stacked chip technology be to processors and memory? When do you think we’ll see the first commercial products?

Dally: Placing memory on-package will be critical to scale bandwidth. Stacking technology is important to extend the capacity of this high-bandwidth memory.

Stacked memory is shipping today. However, most of this today uses wire bonds, not through-silicon vias. At the 2013 GPU Technology Conference (GTC) this past March, we announced that we expect to introduce stacked memories with our Volta architecture-based generation of GPUs – in about 2016.

HPCwire: Government austerity restrictions look as though there could be pressure to reduce investments in exascale computing, especially in the US. How capable is industry of driving these initiatives by itself?

Dally: Industry will continue to move forward on its own on exascale projects, however progress will be much slower than with government assistance.

It is disappointing that government priorities are such that investment in computing innovation is being scaled back. At the same time, other nations like China are investing heavily in computing. Even the EU with all of its economic problems is moving forward with their exascale program. With reduced investment, the US runs a grave risk of giving up its leadership in computing.

HPCwire: With current technology, it seems as though exascale computing would require so much energy as to render it impractical. Will we see new breakthrough technologies to sufficiently reduce power consumption to make exascale practical and affordable?

Dally: Improving energy efficiency to reach the goal of a sustained exaflops on a real application in 20MW is a significant challenge. However, I am optimistic that we can meet this challenge. There are many emerging circuit, architecture and software technologies that have the potential to dramatically improve the energy efficiency of one or more parts of the system. For example, at NVIDIA we have recently developed a new signaling technology that reduces the energy required by communication by more than an order of magnitude, and we have developed an SRAM technology that permits operation at dramatically lower voltages – and hence lower power. It won’t be a single breakthrough technology that will get us to the exascale energy goal, it will be multiple breakthroughs – at least one in each of the multiple areas that require improvement – processor, communication, memory, etc. We have a number of research projects that are targeted at these different areas. If a sufficient number of these projects have successful outcomes, we will meet the goal.

These improvements, however, depend on research, which in turn will be slowed considerably without government funding.

HPCwire: What do you see as the biggest challenges to reaching exascale?

Dally: Energy efficiency and programmability are the two biggest challenges.

For energy, we will need to improve from where we are with the NVIDIA-Kepler-based Titan machine at Oak Ridge National Laboratory in Tennessee, which is about 2GFLOPS/Watt (500pJ/FLOP) to 50GFLOPS/Watt (20pJ/FLOP), a 25x improvement in efficiency while at the same time increasing scale – which tends to reduce efficiency. Of this 25x improvement we expect to get only a factor of 2x to 4x from improved semiconductor process technology.

As I described before, we are optimistic that we can meet this challenge through a number of research advances in circuits, architecture and software.

Making it easy to program a machine that requires 10 billion threads to use at full capacity is also a challenge. While a backward compatible path will be provided to allow existing MPI codes to run, MPI plus C++ or Fortran is not a productive programming environment for a machine of this scale. We need to move toward higher-level programming models where the programmer describes the algorithm with all available parallelism and locality exposed, and tools automate much of the process of efficiently mapping and tuning the program to a particular target machine.

A number of research projects are underway to develop more productive programming systems – and most importantly the tools that will permit automated mapping and tuning.

Changing a large code base, however, is a very slow process, so we need to start moving on this now. As with energy efficiency, progress will be slowed without government funding.

About Bill Dally

Bill Dally is chief scientist at NVIDIA and senior vice president of NVIDIA Research, the company’s world-class research organization, which is chartered with developing the strategic technologies that will help drive the company’s future growth and success.

Dally first joined NVIDIA in 2009 after spending 12 years at Stanford University, where he was chairman of the computer science department and the Willard R. and Inez Kerr Bell Professor of Engineering. Dally and his Stanford team developed the system architecture, network architecture, signaling, routing and synchronization technology that is found in most large parallel computers today.

Dally was previously at the Massachusetts Institute of Technology from 1986 to 1997, where he and his team built the J-Machine and M-Machine, experimental parallel computer systems that pioneered the separation of mechanism from programming models and demonstrated very low overhead synchronization and communication mechanisms. From 1983 to 1986, he was at the California Institute of Technology (Caltech), where he designed the MOSSIM Simulation Engine and the Torus Routing chip, which pioneered wormhole routing and virtual-channel flow control.

Dally is a cofounder of Velio Communications and Stream Processors. He is a member of the National Academy of Engineering, a Fellow of the American Academy of Arts & Sciences, a Fellow of the IEEE and the ACM. He received the 2010 Eckert-Mauchly Award, considered the highest prize in computer architecture, as well as the 2004 IEEE Computer Society Seymour Cray Computer Engineering Award and the 2000 ACM Maurice Wilkes Award. He has published more than 200 papers, holds more than 75 issued patents and is the author of two textbooks, “Digital Systems Engineering” and “Principles and Practices of Interconnection Networks.”

Dally received a bachelor’s degree in electrical engineering from Virginia Tech, a master’s degree in electrical engineering from Stanford University and a PhD in computer science from Caltech.

]]>The top research stories of the week have been hand-selected from leading scientific centers, prominent journals and relevant conference proceedings. Here’s another diverse set of items, including an evaluation of sparse matrix multiplication performance on Xeon Phi versus four other architectures; a survey of HPC energy efficiency; performance modeling of OpenMP, MPI and hybrid scientific applications using weak scaling; an exploration of anywhere, anytime cluster monitoring; and a framework for data-intensive cloud storage.

Evaluating Sparse Matrix Multiplication Kernels on Intel Xeon Phi

The Intel Xeon Phi made a big splash at SC12, and computer scientists are eager to put the coprocessor through its paces. Such is the case with a team of researchers from the Ohio State University, who authored a recent paper, describing their work evaluating sparse matrix multiplication kernels on the Intel Xeon Phi.

As the team notes, the Phi sports 61 cores, each supporting 4 hardware threads with 512-bit wide SIMD registers for a theoretical peak performance of 1 teraflops double precision.

Their paper is meant to serve as an introduction to the Phi architecture and to analyze its peak performance using the sparse matrix as a test application. It’s a good choice to test the Phi’s capabilities because it is representative of many large-scale applications and because it is a difficult problem for coprocessor architectures.

As the team writes: “Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV).”

They also note that “the irregularity and sparsity of SpMV-like kernels create several problems for these architectures [i.e. accelerators/coprocessors].”

The researchers compared the sparse matrix multiplication performance of Xeon Phi with four other architectures: two dual Intel Xeon processors, X5680 (Westmere) and E5-2670 (Sandy Bridge), as well as two NVIDIA Tesla GPUs C2050 and K20. They results of their experiment show that the Phi offered superior performance.

They write that “although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cutting-edge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi’s sparse kernel performance is very promising and even better than that of cutting-edge general purpose processors and GPUs.”

A group of researchers from the Walchand College of Engineering, in the city of Sangli, Maharashtra, India, have published a paper addressing one of the most pressing problems in high-performance computing: energy-efficiency.

The team sets out by acknowledging the increased awareness of energy and costs associated with power management for high performance computing. They write that “power control is becoming a key challenge for effectively operating a modern high end computing infrastructures such as server, clusters, data centers and grids,” although the scope of the paper is primarily concerned with cluster systems.

The researchers argue that developing energy efficient computer designs is the next major goal of the high performance computing. The paper presents a survey and classification of energy efficient techniques for cluster computing. The research outlines both hardware and software related variables and sub-classes thereof. An important point made in the paper is that performance itself does not become a secondary objective but it is understood that power is a constraint to increasing performance.

Texas A&M University computer scientists Xingfu Wu and Valerie Taylor are exploring a performance modeling framework based on memory bandwidth contention time and a parameterized communication model. They have co-authored a paper describing their work with modeling and predicting the performance of OpenMP, MPI and hybrid scientific applications using weak scaling on large-scale multicore supercomputers.

The research team employed STREAM memory benchmarks to identify initial performance and model validation of MPI and OpenMP applications. They also used the hybrid large-scale scientific application Gyrokinetic Toroidal Code in magnetic fusion to validate the performance model.

The experiment used three different supercomputers: an IBM POWER4, POWER5+ and BlueGene/P. Study results showed an error rate of less than 7.77% for predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore systems.

A trio of computer scientists from Shandong University in Jinan, China, are exploring the feasibility of anywhere, anytime cluster monitoring. More specifically, they are working to design and implement a cluster monitoring system based on Android.

The team starts with the view that high performance computing (HPC) has been democratized to the point that HPC clusters have become an important resource for many scientific fields, including graphics, biology, physics, climate research, and many others. Still, depending on local funding realities, the availability of such machines is almost universally constrained. In light of this, monitoring becomes an essential task necessary for the efficient utilization and management of limited resources. However, as the researchers observe, traditional cluster monitoring systems demonstrate poor mobility, which stymies proper management.

The authors are seeking to improve the flexibility of monitoring systems and improve the communication between administrators. They assert that the mobile cluster monitoring system outlined in their paper “will make it possible to monitor the whole cluster anywhere and anytime to allow administrators to manage, diagnose, and troubleshoot cluster issues more accurately and promptly.”

The system they developed is based on the Android platform, the brainchild of Google, and built on open source monitoring tools, Gaglia and Nagios. The design uses a client-server model, where the server probes the data via monitoring tools and produces a global view of the data. The mobile client gets the monitoring packages by Socket. Then, the cluster’s status is displayed on the Android application.

UK computer scientists Victor Chang, Robert John Walters and Gary Wills set out to explore the topic of cloud storage and bioinformatics in a private cloud deployment. They’ve written a paper about their experience to serve as a resource for other researchers with data-intensive compute needs who are interested in analyzing the benefits of a cloud model.

Among the many benefits of the cloud model are its cost-savings potential, agility, efficiency, resource consolidation, business opportunities and possible energy savings. Despite the inherent attractiveness, there are still barriers to overcome, and one of these, according to the authors is the need for a standard or framework to manage both operations and IT services.

They write that “this framework needs to provide the structure necessary to ensure any cloud implementation meets the business needs of industry and academia and include recommendations of best practices which can be adapted for different domains and platforms.”

Their work examines service portability for a private cloud deployment. Storage, backup and data migration and data recovery are all addressed. The paper presents a detailed case study about cloud storage and bioinformatics services developed as part of the Cloud Computing Adoption Framework (CCAF). In order to illustrate the benefits of CCAF the authors provide several bioinformatics examples, including tumor modeling, brain imaging, insulin molecules and simulations for medical training. They believe that their proposed solution offers cost reduction, time-savings and user friendliness.

]]>Automaker BMW is getting ready to deploy an HPC cluster to run simulations for designing its next-generation ultimate driving machines. As with any supercomputing installation, this one is bound to consume plenty of energy, which translates to high operational expenses. So the car company decided to search for an efficient and environmentally friendly plan to manage their system. They settled on locating the machine at Verne Global’s Ásbrú datacenter in Iceland.

The country has become an interesting option for datacenter users because of its perpetually cool climate and cheap energy. Electricity in the island nation costs roughly 4.3 cents per kilowatt-hour, thanks to an abundance of renewable energy sources. The country generates most of its electricity from glacier-fed rivers and geothermal vents. Given these resources, it’s no surprise that Verne Global decided to setup their large scale computing facility at an abandoned NATO Air Force base located in the city of Keflavík.

Data Center Knowledgereported that Mario Mueller, BMW’s vice president of IT infrastructure and chair at the Open Data Center Alliance (ODCA), brought up the company’s plans at this year’s Intel Developer Forum. The car company will be Verne Global’s fifth customer after CCP Games, Datapipe, Opin Kerfi and GreenQloud. It will also follow ODCA usage models to guide the cluster’s build.

This is certainly not the first time a company or organization has considered alternative approaches to providing energy and cooling to a large computing installation. Apple is utilizing solar panels and methane gas from a local landfill to generate electricity for their iCloud datacenter. The Texas Advanced Computing Center (TACC) deployed a top 10 cluster in an oil submersion cooling system and Facebook built one of the world’s most efficient datacenters in Prineville Oregon using designs from the Open Compute Project.

]]>http://www.hpcwire.com/2012/09/20/bmw_finds_cool_locale_for_hpc_cluster/feed/0HP, Intel Score Petaflop Supercomputer at DOE Labhttp://www.hpcwire.com/2012/09/05/hp_intel_score_petaflop_supercomputer_at_doe_lab/?utm_source=rss&utm_medium=rss&utm_campaign=hp_intel_score_petaflop_supercomputer_at_doe_lab
http://www.hpcwire.com/2012/09/05/hp_intel_score_petaflop_supercomputer_at_doe_lab/#commentsWed, 05 Sep 2012 07:00:00 +0000http://www.hpcwire.com/?p=4356<img style="float: left;" src="http://media2.hpcwire.com/hpcwire/NREL_logo.gif" alt="" width="112" height="48" />The US Department of Energy's National Renewable Energy Laboratory (NREL) has ordered a $10 million HP supercomputer equipped with the latest Intel Xeon CPUs and Xeon Phi coprocessors. When completed in 2013, the system will deliver one petaflop of performance and will take up residence in one of the most energy-efficient datacenters in the world.

]]>The US Department of Energy’s National Renewable Energy Laboratory (NREL) has ordered a $10 million HP supercomputer equipped with the latest Intel Xeon CPUs and Xeon Phi coprocessors. When completed in 2013, the system will deliver one petaflop of performance and will take up residence in one of the most energy-efficient datacenters in the world.

The supercomputer will be built in phases with the initial rack of servers scheduled for deployment this November. The first phase will use HP’s ProLiant SL230s and SL250s servers. These will be equipped with the current “Sandy Bridge” Xeons, specifically the new E5-2670 CPUs (8-core 2.6 GHz, 115W). At least some of the SL250s boxes will also host the upcoming “Knights Corner” coprocessor, the first commercial chip in Intel’s new manycore Xeon Phi line. These are due out before the end of the 2012.

The second phase of the HP system will incorporate next year’s “Ivy Bridge” Xeons, built on Intel’s latest 22nm technology. When completed in the summer of 2013, the HP cluster will house about 600 Xeon Phi coprocessors and 3,200 Xeons. Although that’s not a particularly high ratio of accelerators to CPUs, it’s likely that the vector-heavy Xeon Phi silicon will deliver more than half of the total flops for the machine.

While petascale computers are still relatively rare, the more important theme here is energy efficiency. Both the computer and the NREL datacenter (known as the Energy Systems Integration Facility) were designed to minimize power usage. At a cost of $135 million, the new facility, which includes labs and office space, is built to take advantage of the latest warm-water-cooled servers. A big advantage of this technology is that it requires only evaporative coolers for the plumbing. No chillers or mechanical cooling apparatus are needed, reducing power requirements significantly.

According to Steve Hammond, NREL’s Computational Science director, that will make it the most energy-efficient HPC facility in the world when it’s commissioned at the end of September. “We’ve taken a chips-to-bricks approach to datacenters,” Hammond told HPCwire. “We’re managing both the bytes and the BTUs.”

Since a megawatt of electricity costs around a million dollars a year in the US — and even more in Japan and most of Europe — significant savings can be achieved if these facilities can pare down their power consumption. The NREL facility was designed with that goal in mind and is targeting a PUE (power usage effectiveness) of 1.06. So for every unit of power delivered to the computing equipment, only another 0.06 more units will be needed for cooling, power supply losses, and other overhead.

For a large datacenter, that’s nearly unprecedented. According to an EPA study in 2009, the average datacenter was running at a PUE of 1.91. In these facilities, ever watt of power consumed for computing required nearly an additional watt for cooling or was otherwise wasted on transmission losses. As a result, more and more centers are turning to warm-water cooling.

Warm is the keyword here. Intake water for the computing equipment is around room temperature — 75F or thereabouts. Water exiting the servers is approximately 95F and, at NREL, will be recycled to heat the facility. Hammond says that in the future they plan to export the server-warmed water to other buildings on the rest of the campus.

NREL will not only save a nice chunk of change as a result of the energy savings, but the project will also be a showcase for PUE-minimizing design. The power-saving theme also dovetails nicely with the DOE lab’s mission, namely to support research in renewable energy and new energy sources. The HP super will be used to run computer simulations for developing clean energy, advanced solar photovoltaics, wind energy systems, electric vehicles, and renewable fuels.

With regard to the power profile of the petaflop system, HP plans to deliver a full peak petaflop with just a single megawatt. Although that’s not in the same league as an IBM Blue Gene/Q (which delivers well over 2 peak petaflops per MW), its on par with the most efficient GPU-accelerated supercomputers deployed today.

That’s due in no small part to the Xeon Phi coprocessor, which will contribute significantly to the system’s overall energy efficiency. Although Intel has not made public the wattage and performance on the initial Knights Corner chips, they are expected to be competitive with the latest GPUs, in other words, well over a teraflop of double precision number-crunching in under 300 watts.

To uphold the PUE rating of the NREL facility, the HP servers will be primarily warm-water cooled. Not only will that save energy, but it’s also the most practical approach for a petaflop supercomputer that, in this case, is being squeezed into just 1,000 square feet of floor space. The datacenter itself is 10 times that size, but this will give NREL plenty of room for disk and tape storage, not to mention additional HPC systems down the road.

In fact, Hammond says they plan to use the new datacenter for the next two decades, which should take them well into the exascale era. Since the facility can only tap 10MW, NREL will have to wait until those exa-systems fit into that power envelope. The first exaflop supercomputers are expected to draw at least 20MW when they first appear toward the end of this decade.

For now though, the DOE adds yet another petascale supercomputer to its growing roster of elite machines. At $10 million per petaflop, even smaller labs, like NREL, can now tap into computing power that was unheard of just five years ago. In 2008, the first petaflop supercomputer, Roadrunner, cost more than 10 times as much as this HP machine and took up six times as much datacenter real estate. In a few more years, these petaflop systems should be cheap enough and compact enought to be acquired by commercial users. And if these energy-saving technologies continue to be refined, such systems should be relatively inexpensive to run as well.