Tools

"... The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."

The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.

"... A variety of models have been proposed for the study of synchronous parallel computation. These models are reviewed and some prototype problems are studied further. Two classes of models are recognized, fixed connection networks and models based on a shared memory. Routing and sorting are prototype ..."

A variety of models have been proposed for the study of synchronous parallel computation. These models are reviewed and some prototype problems are studied further. Two classes of models are recognized, fixed connection networks and models based on a shared memory. Routing and sorting are prototype problems for the networks; in particular, they provide the basis for simulating the more powerful shared memory models. It is shown that a simple but important class of deterministic strategies (oblivious routing) is necessarily inefficient with respect to worst case analysis. Routing can be viewed as a special case of sorting, and the existence of an O(log n) sorting algorithm for some n processor fixed connection network has only recently been established by Ajtai, Komlos, and Szemeredi (“15th ACM Sympos. on Theory of Cornput., ” Boston, Mass., 1983, pp. l-9). If the more powerful class of shared memory models is considered then it is possible to simply achieve an O(log n loglog n) sort via Valiant’s parallel merging algorithm, which it is shown can be implemented on certain models. Within a spectrum of shared memory models, it is shown that loglogn is asymptotically optimal for n processors to merge two sorted lists containing n elements. 0 1985 Academic Press, Inc.

...te in a given step, and we usually assume that the network is reasonably sparse; as examples, consider the shuffle-exchange network (Stone [26]) and its development into the ultracomputer of Schwartz =-=[22]-=-, the array or mesh connected processors such as the Illiac IV, the cube-connected cycles of Preparata and Viullemin [20], or the more basic d-dimensional hypercube studied in Valiant and Brebner [29]...

"... In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential pr ..."

In this paper we implement several basic operating system primitives by using a &quot;replace-add&quot; operation, which can supersede the standard &quot;test and set&quot;, and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request.

by
James M. Ortega, Robert G. Volgt, James M. Ortega, Robert G. Voigt
- Proc. 1977 Army Numerical Analysis and Computers Conference, 1977

"... In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with par-ticular attention paid to those characteristics t ..."

In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with par-ticular attention paid to those characteristics that influence algorithm selecUon. Both direct and iteraUve methods are given for elliptic equations as well as explicit and impli-cit methods for initial-boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restricUons or the lack of adequate algorithms. A brief dis-cussion of application areas utilizing these computers is included.

"... We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared ..."

We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared memory; the complexity measures are completed work (where processors are charged for completed fixed-size update cycles) and overhead ratio (completed work amortized over necessary work and failures). In the second model, the result of the computation is a serializaton of the actions of the processors determined by an on-line adversary; the complexity measure is total work (number of steps taken by all processors). Despite their differences the two models share key algorithmic techniques. We present new algorithms for the Write-All problem (in which P processors write ones into an array of size N ) for the two models. These algorithms can be used to implement a simulation strategy for any N ...

"... Four algorithms are analyzed in the shared and nonshared (distributed) memory models ofparallel computation. The analysis shows that the shared memory model predicts optimality for algorithms and programming styles that cannot be realized on any physical parallel computers. Programs based on these t ..."

Four algorithms are analyzed in the shared and nonshared (distributed) memory models ofparallel computation. The analysis shows that the shared memory model predicts optimality for algorithms and programming styles that cannot be realized on any physical parallel computers. Programs based on these techniques are inferior to programs wrinen in the nonshared memory model. The “unit ” cost charged for a reference to shared memory is argued to be the source of the shared memory model’s inaccuracy. The implications of these observations are discussed. I.

by
John H. Miller
- Tournament Selection and the Effects of Noise”, Complex Systems 9, 1995

"... The organization of information processing resources is a central question in economic, organizational, and computational theory. Recent work by Radner (1992) and others has developed a simple theoretical framework and some useful formal mathematical results about the behavior of such systems. Here, ..."

The organization of information processing resources is a central question in economic, organizational, and computational theory. Recent work by Radner (1992) and others has developed a simple theoretical framework and some useful formal mathematical results about the behavior of such systems. Here, we follow a complementary computational approach that allows us to pursue questions concerning the impact of coordination and various exogenous conditions facing the organization. We find that organizations demonstrate &quot;order for free,&quot; that is, given a simple structural framework and a set of standard operating procedures, even randomly generated organizations imply well-defined patterns of behavior. Using a genetic algorithm, we also show that simple evolutionary processes allow organizations to &quot;learn&quot; better structures. I am grateful to Jody Lutz and, especially, Hollis Schuler for research assistance, and to Dean Behrens, Kathleen Carley, and Robyn Dawes for useful comments. I would ...

"... The development of High-Performance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant st ..."

The development of High-Performance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant study. Because of key differences between development for HPC and for more conventional software engineering applications, this work has required the tailoring of experimental designs and protocols. A major contribution of our work is to begin to quantify the code development process in a specialized area that has previously not been extensively studied. Specifically, we present an analysis of the domain of High-Performance Computing for the aspects that would impact experimental design; show how those aspects are reflected in experimental design for this specific area; and demonstrate how we are using such experimental designs to build up a body of knowledge specific to the domain. Results to date build confidence in our approach by showing that there are no significant differences across studies comparing subjects with similar experience tackling similar problems, while there are significant differences in performance and effort among the different parallel models applied.

...asurable decrease in execution time. Within the HPC community, metrics and even predictive models have already been developed for measuring the final code performance, under various constraints (e.g. =-=[12, 5]-=-). However, little empirical work has been done to date to study the human effort required in the development of those solutions. There has been for example no investigation of what percentage of the ...

"... We study the presence of cycles and long paths in graphs that have been proposed as interconnection networks for parallel architectures. The study surveys and complements known results. 1 Introduction This paper is devoted to studying embeddings of the simplest possible guest graphs, the path P ..."

We study the presence of cycles and long paths in graphs that have been proposed as interconnection networks for parallel architectures. The study surveys and complements known results. 1 Introduction This paper is devoted to studying embeddings of the simplest possible guest graphs, the path PN and the cycle CN , in graphs that have been proposed as interconnection networks for parallel architectures. In addition to their intrinsic interest, in terms of the development of algorithms on parallel architectures, these two guest graphs are important because of the fact that many structurally richer graphs can be constructed from paths and cycles by various product constructions. A few of the results we present are original; several appear in the literature and are duly cited; many belong to the folklore of the field. Indeed this paper is motivated by a desire to find a single repository for this important, yet scattered material. Before proceeding further, we define formally the ...