"... Performance measurement of parallel, objectoriented (OO) programs requires the development of instrumentation and analysis techniques beyond those used for more traditional languages. Performance events must be redefined for the conceptual OO programming model, and those events must be instrumented ..."

Performance measurement of parallel, objectoriented (OO) programs requires the development of instrumentation and analysis techniques beyond those used for more traditional languages. Performance events must be redefined for the conceptual OO programming model, and those events must be instrumented and tracked in the context of OO language abstractions, compilation methods, and runtime execution dynamics. In this paper, we focus on the profiling and tracing of C++ applications that have been written using a rich parallel programming framework for highperformance, scientific computing. We address issues of class-based profiling, instrumentation of templates, runtime function identification, and polymorphic (type-based) profiling. Our solutions are implemented in the TAU portable profiling package which also provides support for profiling groups and userlevel timers. We demonstrate TAU&apos;s C++ profiling capabilities for real parallel applications, built from components of the ACTS toolkit....

"... The goal of producing architecture-independent parallel programs is complicated by the competing need for high performance. The ZPL programming language achieves both goals by building upon an abstract parallel machine and by providing programming constructs that allow the programmer to "see ..."

The goal of producing architecture-independent parallel programs is complicated by the competing need for high performance. The ZPL programming language achieves both goals by building upon an abstract parallel machine and by providing programming constructs that allow the programmer to &quot;see&quot; this underlying machine. This paper describes ZPL and provides a comprehensive evaluation of the language with respect to its goals of performance, portability, and programming convenience. In particular, we describe ZPL&apos;s machine-independent performance model, describe the programming benefits of ZPL&apos;s region-based constructs, summarize the compilation benefits of the language&apos;s high-level semantics, and summarize empirical evidence that ZPL has achieved both high performance and portability on diverse machines such as the IBM SP-2, Cray T3E, and SGI Power Challenge. Index Terms: portable, efficient, parallel programming language. This research was supported by DARPA Grant F30602-97-1-0152, a grant of HPC time from the Arctic Region Supercomputing Center, NSF Grant CCR--9707056, and ONR grant N00014-99-1-0402. 1 1

"... With the increasing popularity of parallel programming environments such as PC clusters, more and more sequential programmers, with little knowledge about parallel architectures and parallel programming, are hoping to write parallel programs. Numerous attempts have been made to develop high-level pa ..."

With the increasing popularity of parallel programming environments such as PC clusters, more and more sequential programmers, with little knowledge about parallel architectures and parallel programming, are hoping to write parallel programs. Numerous attempts have been made to develop high-level parallel programming libraries that use abstraction to hide low-level concerns and reduce difficulties in parallel programming. Among them, libraries of parallel skeletons have emerged as a promising way towards this direction. Unfortunately, these libraries are not well accepted by sequential programmers, because of incomplete elimination of lower-level details, ad-hoc selection of library functions, unsatisfactory performance, or lack of convincing application examples. This paper addresses principle of designing skeleton libraries of parallel programming and reports implementation details and practical applications of a skeleton library SkeTo. The SkeTo library is unique in its feature that it has a solid theoretical foundation based on the theory of Constructive Algorithmics, and is practical to be used to describe various parallel computations in a sequential manner. 1.

...gh Skil is an epoch-making system in the research of skeletal parallel programming, it is now somewhat obsolete because most of the enhanced features can be easily achieved by the C++ language. HPC++ =-=[38]-=- is another system that introduces extensions (compiler directives) into the base language. HPC++ is a C++ library, developed from the viewpoint of parallelization of the standard template library. Al...

"... Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism ..."

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared memory multiprocessors are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an...

...ese libraries support parallel iteration over both regular and irregular collections of data. The Standard Template Adaptive Parallel Library (STAPL) and the Parallel Standard Template Library (PSTL) =-=[62]-=- are parallel implementations of much of the C++ Standard Template Library (STL). An et al. [4] present STAPL as well as a thorough comparison of many parallel data structure libraries. Other generic ...

"... With the advent of multi-core processors, desktop application devel-opers must finally face parallel computing and its challenges. A large portion of the computational load in a program rests within iterative computations. In object-oriented languages these are commonly handled using iterators which ..."

With the advent of multi-core processors, desktop application devel-opers must finally face parallel computing and its challenges. A large portion of the computational load in a program rests within iterative computations. In object-oriented languages these are commonly handled using iterators which are inadequate for parallel programming. This paper presents a powerful Parallel Iterator concept to be used in object-oriented programs for the parallel traversal of a collection of elements. The Paral-lel Iterator may be used with any collection type (even those inherently sequential) and it supports several scheduling schemes which may even be decided dynamically at run-time. Some additional features are provided to allow early termination of parallel loops, exception handling and a so-lution for performing reductions. With a slight contract modification, the Parallel Iterator interface imitates that of the Java-style sequential itera-tor. All these features combine together to promote minimal, if any, code restructuring. Along with the ease of use, the results reveal negligible overhead and the expected inherent speedup.

"... A methodology for the design and development of data parallel applications and components is presented. Dataparallelism is a well understood form of parallel computation, yet developing simple applications can involve substantial efforts to express the problem in low-level notations. We describe a p ..."

A methodology for the design and development of data parallel applications and components is presented. Dataparallelism is a well understood form of parallel computation, yet developing simple applications can involve substantial efforts to express the problem in low-level notations. We describe a process of software development for data-parallel applications starting from high-level specifications, generating repeated refinements of designs to match different architectural models and performance constraints, enabling a development activity with cost-benefit analysis. Primary issues are algorithm choice, correctness and efficiency, followed by data decomposition, load balancing and messagepassing coordination. Development of a data-parallel multitarget tracking application is used as a case study, showing the progression from high to low-level refinements. We conclude by describing tool support for the process. 1.

"... Abstract. We present the design and implementation of the stapl pList, a parallel container that has the properties of a sequential list, but allows for scalable concurrent access when used in a parallel program. The Standard Template Adaptive Parallel Library (stapl) is a parallel programming libra ..."

Abstract. We present the design and implementation of the stapl pList, a parallel container that has the properties of a sequential list, but allows for scalable concurrent access when used in a parallel program. The Standard Template Adaptive Parallel Library (stapl) is a parallel programming library that extends C++ with support for parallelism. stapl provides a collection of distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. stapl pContainers are thread-safe, concurrent objects, providing appropriate interfaces (e.g., views) that can be used by generic pAlgorithms. The pList provides stl equivalent methods, such as insert, erase, and splice, additional methods such as split, and efficient asynchronous (non-blocking) variants of some methods for improved parallel performance. We evaluate the performance of the stapl pList on an IBM Power 5 cluster and on a CRAY XT4 massively parallel processing system. Although lists are generally not considered good data structures for parallel processing, we show that pList methods and pAlgorithms (p generate and p partial sum) operating on pLists provide good scalability on more than 10 3 processors and that pList compares favorably with other dynamic data structures such as the pVector. 1

...uitable for parallel programming, more dynamic data structures that allow insertion and deletion of elements have not received as much attention. The PSTL (Parallel Standard Template Library) project =-=[14]-=- explored the same underlying philosophy as stapl of extending the C++ stl for parallel programming. They planned to provide a distributed list, but the project is no longer active. Intel Threading Bu...

"... We describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing user-defined high-level abstractions. ..."

We describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing user-defined high-level abstractions.

"... Several large real-world applications have been developed for distributed and parallel architectures. We examine two different program development approaches: First, the usage of a high-level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the c ..."

Several large real-world applications have been developed for distributed and parallel architectures. We examine two different program development approaches: First, the usage of a high-level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the cost of a reduced performance. A source-to-source compiler, has been employed to automatically compile programs -- written in a high-level programming paradigm -- into message passing codes. Second, manual program development by using a low-level programming paradigm -- such as message passing -- enables the programmer to fully exploit a given architecture at the cost of a time-consuming and error-prone effort. Performance tools play a central role to support the performance-oriented development of applications for distributed and parallel architectures. Scala -- a portable instrumentation, measurement, and post-execution performance analysis system for distributed and parallel programs -- h...

...compiling, executing, and performance analysing. Many different programming paradigms such as explicit message passing [20], High Performance Fortran (HPF) [26], OpenMP [11], Java RMI [44], and HPC++ =-=[30]-=- have been introduced for distributed and parallel architectures. A trade-off is implied by the programming paradigm employed. On the one hand, programming at a low level (i.e., message passing paradi...