IEEE IPDPS 2011

TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011.
Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

IPDPS 2011 Keynotes and Panels

The talk describes algorithm engineering (AE) as a methodology for algorithmic research where design, analysis, implementation and experimental evaluation of algorithms form a feedback cycle driving the development of efficient algorithm. Additional important components of the methodology include realistic models, algorithm libraries, and collections of realistic benchmark instances. We use one main example throughout this paper: sorting huge data sets using many multi-core processors and disks. The described system broke records for the GraySort and MinuteSort sorting benchmarks and helped with the record for the JouleSort benchmark.

The 25th year of IPDPS gives us the opportunity to look back and (to attempt) to assess what has gone wrong, what has gone well, and what came as a surprise, in the field of parallel and distributed processing. The panel members will give a few examples of striking events that took place in their area (covering Algorithms/ Applications/ Architectures/ Software). They will also give a short statement on how they would summarize the evolution of the field as a whole over the last 25 years.
Panelists:
William (Bill) Dally, Stanford & NVIDIA
Jack Dongarra, University of Tennessee & Oak Ridge National Laboratory
Satoshi Matsuoka, Tokyo Institute of Technology, Japan
Rob Schreiber, HP Labs, Palo Alto
Horst Simon, NERSC, Lawrence Berkeley National Laboratory
Uzi Vishkin, University of Maryland

In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder. We will look at five areas of research that will have an importance impact in the development of software and algorithms. We will focus on following themes:
Redesign of software to fit multicore and hybrid architectures,
Automatically tuned application software,
Exploiting mixed precision for performance,
The importance of fault tolerance,
Communication avoiding algorithms.

Parallel computing has become ubiquitous and relates to challenging computational problems in science via business-driven computing to mobile computing. The scope has widened dramatically over the last decade. This panel will debate and speculate on how the parallel computing landscape is expected to change in the years to come. Areas of focus will include:
Computing platforms: How will we be able to maintain the performance growth of the past and what will be the major challenges in the next 10 years and beyond that? What technical barriers are anticipated and what disruptive technologies are behind the corner?
Software: How will software infrastructures evolve to meet performance requirements in the next 10 years and beyond? How will we ever be able to hide parallelism obstacles for the masses and what is the road forward towards that?
Algorithms: What will be the major computational problems to tackle in the next 10 years and beyond? What are the most challenging algorithmic problems to solve?
Applications: What will be the next wave of grand challenge problems to focus on in the next 10 years and beyond? What will be the major performance driving applications in the general and mobile computing domains?
Panelists:
Doug Burger, Microsoft Research;
Wen-mei Hwu, University of Illinois, Urbana-Champaign;
Vipin Kumar, University of Minnesota;
Kunle Olukotun, Stanford University;
David Padua, University of Illinois, Urbana-Champaign;
Burton Smith, Microsoft.

Reaching an ExaScale computer by the end of the decade, and enabling the continued performance scaling of smaller systems requires significant research breakthroughs in three key areas: power efficiency, programmability, and execution granularity. To build an ExaScale machine in a power budget of 20MW requires a 200-fold improvement in energy per instruction: from 2nJ to 10pJ. Only 4x is expected from improved technology. The remaining 50x must come from improvements in architecture and circuits. To program a machine of this scale requires more productive parallel programming environments - that make parallel programming as easy as sequential programming is today. Finally, problem size and memory size constraints prevent the continued use of weak scaling, requiring these machines to extract parallelism at very fine granularity - down to the level of a few instructions. This talk will discuss these challenges and current approaches to address them.

SESSION: Best Papers

In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on
our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors).
For such parameters, our auto-tuner generates and compiles new code on-the-?y. Effectively, we merge traditional feedback
directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in todayâ€™s
HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two
parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version
of the program.

Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data
structures thus avoiding data dependency con?icts. Lock synchronization is known to be a key limitation to performance and
scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance
degradation. This paper proposes and evaluates GLocks, a hardware-supported implementation for highly-contended locks
in the context of many-core CMPs. GLocks use a token-based message-passing protocol over a dedicated network built on
state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely ef?cient and fair
lock implementation with negligible impact on energy consumption or die area.
A comprehensive comparison against the most ef?cient shared-memory-based lock implementation for a set of microbenchmarks and real applications quanti?es the goodness of GLocks. Performance results show an average reduction
of 42% and 14% in execution time, an average reduction of 76% and 23% in network traf?c, and also an average reduction of
78% and 28% in energy-delay^2 product (ED^2P) metric for the full CMP for the microbenchmarks and the real applications,
respectively. In light of our performance results, we can conclude that GLocks satisfy our initial working hypothesis. GLocks
minimize cache-coherence network traf?c due to lock synchronization which translates into reduced power consumption and
execution time.

Recent advances in neuroscienti?c understanding make parallel computing devices modeled after the human neocortex
a plausible, attractive, fault-tolerant, and energy-ef?cient possibility. Such attributes have once again sparked an interest in
creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain.
In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural
and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of
processing parallelism, making todayâ€™s GPGPUs a highly attractive and readily-available hardware accelerator for such a
model.
Furthermore, we consider two inef?ciencies inherent to our initial design: multiple kernel-launch overhead and poor
utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the
hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU
architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we
create a runtime pro?ling tool for our parallel learning algorithm which proportionally distributes the cortical network across
the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using
the pro?ling tool with these optimizations on Nvidiaâ€™s CUDA framework, we achieve up to 60x speedup over a singlethreaded CPU implementation of the model.

We present a novel algorithm to solve the nonnegative single-source shortest path problem on road networks and other graphs
with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the
graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder
vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures,
such as SSE and multi-core. Compared to Dijkstraâ€™s algorithm, our method needs fewer operations, has better locality, and
is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing
our algorithm on a GPU, where our algorithm is up to three orders of magnitude faster than Dijkstraâ€™s algorithm on a highend CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. Several
algorithms, such as computing the graph diameter, exact arc ?ags, or centrality measures (exact reaches or betweenness), can
be greatly accelerated by our method.

Special NSF-SEES Presentation

SEES (Science, Engineering and Education for Sustainability) describes an
investment area established by NSF in 2010 to enable discoveries needed to
inform actions that lead to environmental, energy and societal
sustainability. SEES is expected to be a multi-year effort that will address
challenges in climate and energy research and education using a systems-
based approach to understanding, predicting, and reacting to change in the
linked natural, social, and built environment. In FY 2011, NSF plans to
encourage interdisciplinary research and education on energy sustainability,
with a particular emphasis on the socioeconomic and environmental
implications. Because the SEES initiative is very much an interdisciplinary
program, the purpose of this presentation is to give the IPDPS community a
briefing on existing and upcoming funding opportunities and how they can
start to build communities to take advantage of these opportunities.