Sample records for parallel computational fluid

The discipline of computationalfluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computationalfluid dynamics algorithm on a massively parallelcomputer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.

To accomplish more detailed simulations of highly complex flows, such as the transition to turbulence, fluid dynamics research requires computers much more powerful than any available today. Only parallel processing on multiple-processor computers offers hope for achieving the required effective speeds. Looking ahead to the use of these machines, the fluid dynamicist faces three issues: algorithm development for near-term parallelcomputers, architecture development for future computer power increases, and assessment of possible advantages of special purpose designs. Two projects at NASA Langley address these issues. Software development and algorithm exploration is being done on the FLEX/32 Parallel Processing Research Computer. New architecture features are being explored in the special purpose hardware design of the Navier-Stokes Computer. These projects are complementary and are producing promising results.

Distributed-memory parallelcomputers dominate today's parallelcomputing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallelcomputing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate ComputationalFluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

A finite difference code was implemented for the compressible Navier-Stokes equations on the Connection Machine, a massively parallelcomputer. The code is based on the ARC2D/ARC3D program and uses the implicit factored algorithm of Beam and Warming. The codes uses odd-even elimination to solve linear systems. Timings and computation rates are given for the code, and a comparison is made with a Cray XMP.

A study of parallel algorithms for the numerical solution of partial differential equations arising in computationalfluid dynamics is presented. The actual implementation on parallel processors of shared and nonshared memory design is discussed. The performance of these algorithms is analyzed in terms of machine efficiency, communication time, bottlenecks and software development costs. For elliptic equations, a parallel preconditioned conjugate gradient method is described, which has been used to solve pressure equations discretized with high order finite elements on irregular grids. A parallel full multigrid method and a parallel fast Poisson solver are also presented. Hyperbolic conservation laws were discretized with parallel versions of finite difference methods like the Lax-Wendroff scheme and with the Random Choice method. Techniques are developed for comparing the behavior of an algorithm on different architectures as a function of problem size and local computational effort. Effective use of these advanced architecture machines requires the use of machine dependent programming. It is shown that the portability problems can be minimized by introducing high level operations on vectors and matrices structured into program libraries

A real-time visualization system for computationalfluid dynamics in a network connecting between a parallelcomputing server and the client terminal was developed. Using the system, a user can visualize the results of a CFD (ComputationalFluid Dynamics) simulation on the parallelcomputer as a client terminal during the actual computation on a server. Using GUI (Graphical User Interface) on the client terminal, to user is also able to change parameters of the analysis and visualization during the real-time of the calculation. The system carries out both of CFD simulation and generation of a pixel image data on the parallelcomputer, and compresses the data. Therefore, the amount of data from the parallelcomputer to the client is so small in comparison with no compression that the user can enjoy the swift image appearance comfortably. Parallelization of image data generation is based on Owner Computation Rule. GUI on the client is built on Java applet. A real-time visualization is thus possible on the client PC only if Web browser is implemented on it. (author)

At the 19th Annual Conference on ParallelComputationalFluid Dynamics held in Antalya, Turkey, in May 2007, the most recent developments and implementations of large-scale and grid computing were presented. This book, comprised of the invited and selected papers of this conference, details those advances, which are of particular interest to CFD and CFD-related communities. It also offers the results related to applications of various scientific and engineering problems involving flows and flow-related topics. Intended for CFD researchers and graduate students, this book is a state-of-the-art presentation of the relevant methodology and implementation techniques of large-scale computing.

ParallelComputations focuses on parallelcomputation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallelcomputers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallelcomputers and describing techn

In recent years significant advances have been made for parallelcomputers in both hardware and software. Now parallelcomputers have become viable tools in computational mechanics. Many application codes developed on conventional computers have been modified to benefit from parallelcomputers. Significant speedups in some areas have been achieved by parallelcomputations. For single-discipline use of both fluid dynamics and structural dynamics, computations have been made on wing-body configurations using parallelcomputers. However, only a limited amount of work has been completed in combining these two disciplines for multidisciplinary applications. The prime reason is the increased level of complication associated with a multidisciplinary approach. In this work, procedures to compute aeroelasticity on parallelcomputers using direct coupling of fluid and structural equations will be investigated for wing-body configurations. The parallelcomputer selected for computations is an Intel iPSC/860 computer which is a distributed-memory, multiple-instruction, multiple data (MIMD) computer with 128 processors. In this study, the computational efficiency issues of parallel integration of both fluid and structural equations will be investigated in detail. The fluid and structural domains will be modeled using finite-difference and finite-element approaches, respectively. Results from the parallelcomputer will be compared with those from the conventional computers using a single processor. This study will provide an efficient computational tool for the aeroelastic analysis of wing-body structures on MIMD type parallelcomputers.

Abstract Aeroelasticity which involves strong coupling of fluids, structures and controls is an important element in designing an aircraft. Computational aeroelasticity using low fidelity methods such as the linear aerodynamic flow equations coupled with the modal structural equations are well advanced. Though these low fidelity approaches are computationally less intensive, they are not adequate for the analysis of modern aircraft such as High Speed Civil Transport (HSCT) and Advanced Subsonic Transport (AST) which can experience complex flow/structure interactions. HSCT can experience vortex induced aeroelastic oscillations whereas AST can experience transonic buffet associated structural oscillations. Both aircraft may experience a dip in the flutter speed at the transonic regime. For accurate aeroelastic computations at these complex fluid/structure interaction situations, high fidelity equations such as the Navier-Stokes for fluids and the finite-elements for structures are needed. Computations using these high fidelity equations require large computational resources both in memory and speed. Current conventional super computers have reached their limitations both in memory and speed. As a result, parallelcomputers have evolved to overcome the limitations of conventional computers. This paper will address the transition that is taking place in computational aeroelasticity from conventional computers to parallelcomputers. The paper will address special techniques needed to take advantage of the architecture of new parallelcomputers. Results will be illustrated from computations made on iPSC/860 and IBM SP2 computer by using ENSAERO code that directly couples the Euler/Navier-Stokes flow equations with high resolution finite-element structural equations.

This paper presents the application of parallelcomputing techniques to large-scale modeling of fluid flow in the unsaturated zone (UZ) at Yucca Mountain, Nevada. In this study, parallelcomputing techniques, as implemented into the TOUGH2 code, are applied in large-scale numerical simulations on a distributed-memory parallelcomputer. The modeling study has been conducted using an over-one-million-cell three-dimensional numerical model, which incorporates a wide variety of field data for the highly heterogeneous fractured formation at Yucca Mountain. The objective of this study is to analyze the impact of various surface infiltration scenarios (under current and possible future climates) on flow through the UZ system, using various hydrogeological conceptual models with refined grids. The results indicate that the one-million-cell models produce better resolution results and reveal some flow patterns that cannot be obtained using coarse-grid modeling models

Many pool type reactors have forced downward flows inside the core during normal operation; there is a chance of flow inversion when transients occur. During this phase, the flow undergo transition between turbulent and laminar regions where drastic changes take place in terms of momentum and heat transfer, and the decrease in safety margin is usually observed. Additionally, for high Prandtl number fluids such as water, an effect of the velocity profile inside the channel on the temperature distribution is more pronounced over the low Prandtl number ones. This makes the checking of its pressure drop estimation accuracy less important, assuming the code verification is complete. With an advent of powerful computer hardware, engineering applications of computationalfluid dynamics (CFD) methods have become quite common these days. Especially for a fully-turbulent and single phase convective heat transfer, the predictability of the commercial codes has matured enough so that many well-known companies adopt those to accelerate a product development cycle and to realize an increased profitability. In contrast to the above, the transition models for the CFD code are still under development, and the most of the models show limited generality and prediction accuracy. Unlike the system codes, the CFD codes estimate the pressure drop from the velocity profile which is obtained by solving momentum conservation equations, and the resulting friction factor can be a representative parameter for a constant cross section channel flow. In addition, the flow inside a rectangular channel with a high span to gap ratio can be approximated by flow inside parallel plates. The computationalfluid dynamics simulation on the flow between parallel plates showed reasonable prediction capability for the laminar and the turbulent regime.

We propose an object-oriented programming paradigm for parallelization of scientific computing programs, and show that the approach can be a very useful strategy. Generally, parallelization of scientific programs tends to be complicated and unportable due to the specific requirements of each parallelcomputer or compiler. In this paper, we show that the object-oriented programming design, which separates the parallel processing parts from the solver of the applications, can achieve the large improvement in the maintenance of the codes, as well as the high portability. We design the program for the two-dimensional Euler equations according to the paradigm, and evaluate the parallel performance on IBM SP2. (author)

An efficient and accurate solver is developed to simulate the non-linear fluid-structural interactions in turbomachinery flutter flows. A new low diffusion E-CUSP scheme, Zha CUSP scheme, is developed to improve the efficiency and accuracy of the inviscid flux computation. The 3D unsteady Navier-Stokes equations with the Baldwin-Lomax turbulence model are solved using the finite volume method with the dual-time stepping scheme. The linearized equations are solved with Gauss-Seidel line iterations. The parallelcomputation is implemented using MPI protocol. The solver is validated with 2D cases for its turbulence modeling, parallelcomputation and unsteady calculation. The Zha CUSP scheme is validated with 2D cases, including a supersonic flat plate boundary layer, a transonic converging-diverging nozzle and a transonic inlet diffuser. The Zha CUSP2 scheme is tested with 3D cases, including a circular-to-rectangular nozzle, a subsonic compressor cascade and a transonic channel. The Zha CUSP schemes are proved to be accurate, robust and efficient in these tests. The steady and unsteady separation flows in a 3D stationary cascade under high incidence and three inlet Mach numbers are calculated to study the steady state separation flow patterns and their unsteady oscillation characteristics. The leading edge vortex shedding is the mechanism behind the unsteady characteristics of the high incidence separated flows. The separation flow characteristics is affected by the inlet Mach number. The blade aeroelasticity of a linear cascade with forced oscillating blades is studied using parallelcomputation. A simplified two-passage cascade with periodic boundary condition is first calculated under a medium frequency and a low incidence. The full scale cascade with 9 blades and two end walls is then studied more extensively under three oscillation frequencies and two incidence angles. The end wall influence and the blade stability are studied and compared under different

The work in the field of parallel processing has developed as research activities using several numerical Monte Carlo simulations related to basic or applied current problems of nuclear and particle physics. For the applications utilizing the GEANT code development or improvement works were done on parts simulating low energy physical phenomena like radiation, transport and interaction. The problem of actinide burning by means of accelerators was approached using a simulation with the GEANT code. A program of neutron tracking in the range of low energies up to the thermal region has been developed. It is coupled to the GEANT code and permits in a single pass the simulation of a hybrid reactor core receiving a proton burst. Other works in this field refers to simulations for nuclear medicine applications like, for instance, development of biological probes, evaluation and characterization of the gamma cameras (collimators, crystal thickness) as well as the method for dosimetric calculations. Particularly, these calculations are suited for a geometrical parallelization approach especially adapted to parallel machines of the TN310 type. Other works mentioned in the same field refer to simulation of the electron channelling in crystals and simulation of the beam-beam interaction effect in colliders. The GEANT code was also used to simulate the operation of germanium detectors designed for natural and artificial radioactivity monitoring of environment

For computationalfluid dynamics, the governing equations are solved on a discretized domain of nodes, faces, and cells. The quality of the grid or mesh can be a driving source for error in the results. While refinement studies can help guide the creation of a mesh, grid quality is largely determined by user expertise and understanding of the flow physics. Adaptive mesh refinement is a technique for enriching the mesh during a simulation based on metrics for error, impact on important parameters, or location of important flow features. This can offload from the user some of the difficult and ambiguous decisions necessary when discretizing the domain. This work explores the implementation of adaptive mesh refinement in an implicit, unstructured, finite-volume solver. Consideration is made for applying modern computational techniques in the presence of hanging nodes and refined cells. The approach is developed to be independent of the flow solver in order to provide a path for augmenting existing codes. It is designed to be applicable for unsteady simulations and refinement and coarsening of the grid does not impact the conservatism of the underlying numerics. The effect on high-order numerical fluxes of fourth- and sixth-order are explored. Provided the criteria for refinement is appropriately selected, solutions obtained using adapted meshes have no additional error when compared to results obtained on traditional, unadapted meshes. In order to leverage large-scale computational resources common today, the methods are parallelized using MPI. Parallel performance is considered for several test problems in order to assess scalability of both adapted and unadapted grids. Dynamic repartitioning of the mesh during refinement is crucial for load balancing an evolving grid. Development of the methods outlined here depend on a dual-memory approach that is described in detail. Validation of the solver developed here against a number of motivating problems shows favorable

Computation of three-dimensional (3-D) Rayleigh--Taylor instability in compressible fluids is performed on a MIMD computer. A second-order TVD scheme is applied with a fully parallelized algorithm to the 3-D Euler equations. The computational program is implemented for a 3-D study of bubble evolution in the Rayleigh--Taylor instability with varying bubble aspect ratio and for large-scale simulation of a 3-D random fluid interface. The numerical solution is compared with the experimental results by Taylor

In the thermo-fluid analysis code named CUPID, the linear system of pressure equations must be solved in each iteration step. The time for repeatedly solving the linear system can be quite significant because large sparse matrices of Rank more than 50,000 are involved and the diagonal dominance of the system is hardly hold. Therefore parallelization of the linear system solver is essential to reduce the computing time. Meanwhile, Graphics Processing Units (GPU) have been developed as highly parallel, multi-core processors for the global demand of high quality 3D graphics. If a suitable interface is provided, parallelization using GPU can be available to engineering computing. NVIDIA provides a Software Development Kit(SDK) named CUDA(Compute Unified Device Architecture) to code developers so that they can manage GPUs for parallelization using the C language. In this research, we implement parallel routines for the linear system solver using CUDA, and examine the performance of the parallelization. In the next section, we will describe the method of CUDA parallelization for the CUPID code, and then the performance of the CUDA parallelization will be discussed

This report summarizes the results of the project entitled, ''Piecewise-Parabolic Methods for ParallelComputation with Applications to Unstable Fluid Flow in 2 and 3 Dimensions'' This project covers a span of many years, beginning in early 1987. It has provided over that considerable period the core funding to my research activities in scientific computation at the University of Minnesota. It has supported numerical algorithm development, application of those algorithms to fundamental fluid dynamics problems in order to demonstrate their effectiveness, and the development of scientific visualization software and systems to extract scientific understanding from those applications.

The impact of parallel processing on computational science and, in particular, on computationalfluid dynamics is growing rapidly. In this paper, particular emphasis is given to developments which have occurred within the past two years. Parallel processing is defined and the reasons for its importance in high-performance computing are reviewed. Parallelcomputer architectures are classified according to the number and power of their processing units, their memory, and the nature of their connection scheme. Architectures which show promise for fluid dynamics applications are emphasized. Fluid dynamics problems are examined for parallelism inherent at the physical level. CFD algorithms and their mappings onto parallel architectures are discussed. Several example are presented to document the performance of fluid dynamics applications on present-generation parallel processing devices

Computationalfluid dynamics (CFD) solution approximations for complex fluid flow problems have become a common and powerful engineering analysis technique. These tools, though qualitatively useful, remain limited in practice by their underlying inverse relationship between simulation accuracy and overall computational expense. While a great volume of research has focused on remedying these issues inherent to CFD, one traditionally overlooked area of resource reduction for engineering analysis concerns the basic definition and determination of functional relationships for the studied fluid flow variables. This artificial relationship-building technique, called meta-modeling or surrogate/offline approximation, uses design of experiments (DOE) theory to efficiently approximate non-physical coupling between the variables of interest in a fluid flow analysis problem. By mathematically approximating these variables, DOE methods can effectively reduce the required quantity of CFD simulations, freeing computational resources for other analytical focuses. An idealized interpretation of a fluid flow problem can also be employed to create suitably accurate approximations of fluid flow variables for the purposes of engineering analysis. When used in parallel with a meta-modeling approximation, a closed-form approximation can provide useful feedback concerning proper construction, suitability, or even necessity of an offline approximation tool. It also provides a short-circuit pathway for further reducing the overall computational demands of a fluid flow analysis, again freeing resources for otherwise unsuitable resource expenditures. To validate these inferences, a design optimization problem was presented requiring the inexpensive estimation of aerodynamic forces applied to a valve operating on a simulated piston-cylinder heat engine. The determination of these forces was to be found using parallel surrogate and exact approximation methods, thus evidencing the comparative

A clear illustration of how parallelcomputers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallelcomputers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop

An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallelcomputers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallelcomputers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallelcomputers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

For a wide class of problems in scientific computing, in particular for partial differential equations, the multigrid principle has proved to yield highly efficient numerical methods. However, the principle has to be applied carefully: if the multigrid components are not chosen adequately with respect to the given problem, the efficiency may be much smaller than possible. This has been demonstrated for many practical problems. Unfortunately, the general theories on multigrid convergence do not give much help in constructing really efficient multigrid algorithms. Although some progress has been made in bridging the gap between theory and practice during the last few years, there are still several theoretical approaches which are misleading rather than helpful with respect to the objective of real efficiency. The research in finding highly efficient algorithms for non-model applications therefore is still a sophisticated mixture of theoretical considerations, a transfer of experiences from model to real life problems and systematical experimental work. The emphasis of the practical research activity today lies - among others - in the following fields: - finding efficient multigrid components for really complex problems, - combining the multigrid approach with advanced discretizative techniques: - constructing highly parallel multigrid algorithms. In this paper, we want to deal mainly with the last topic

This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...

Practical ParallelComputing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel σ machine gives an improvement factor close to 64 over the single-processor CRAY-2

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2. 12 refs

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.

Algorithmically Specialized ParallelComputers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster

The adaptation of a reservoir simulator for parallelcomputations is described. The simulator was originally designed for vector processors. It performs approximately 99% of its calculations in vector/parallel mode and relative to scalar calculations it achieves speedups of 65 and 81 for black oil and EOS simulations, respectively on the CRAY C-90

The study of plasma turbulence and transport is a complex problem of critical importance for fusion-relevant plasmas. To this day, the fluid treatment of plasma dynamics is the best approach to realistic physics at the high resolution required for certain experimentally relevant calculations. Core and edge turbulence in a magnetic fusion device have been modeled using state-of-the-art, nonlinear, three-dimensional, initial-value fluid and gyrofluid codes. Parallel implementation of these models on diverse platforms--vector parallel (National Energy Research Supercomputer Center's CRAY Y-MP C90), massively parallel (Intel Paragon XP/S 35), and serial parallel (clusters of high-performance workstations using the Parallel Virtual Machine protocol)--offers a variety of paths to high resolution and significant improvements in real-time efficiency, each with its own advantages. The largest and most efficient calculations have been performed at the 200 Mword memory limit on the C90 in dedicated mode, where an overlap of 12 to 13 out of a maximum of 16 processors has been achieved with a gyrofluid model of core fluctuations. The richness of the physics captured by these calculations is commensurate with the increased resolution and efficiency and is limited only by the ingenuity brought to the analysis of the massive amounts of data generated

The June 1992 to May 1993 grant NCC-2-677 provided for the continued demonstration of ComputationalFluid Dynamics (CFD) as applied to the Stratospheric Observatory for Infrared Astronomy (SOFIA). While earlier grant years allowed validation of CFD through comparison against experiments, this year a new design proposal was evaluated. The new configuration would place the cavity aft of the wing, as opposed to the earlier baseline which was located immediately aft of the cockpit. This aft cavity placement allows for simplified structural and aircraft modification requirements, thus lowering the program cost of this national astronomy resource. Three appendices concerning this subject are presented.

Until relatively recently almost all the algorithms for use on computers had been designed on the (usually unstated) assumption that they were to be run on single processor, serial machines. With the introduction of vector processors, array processors and interconnected systems of mainframes, minis and micros, however, various forms of parallelism have become available. The advantage of parallelism is that it offers increased overall processing speed but it also raises some fundamental questions, including: (i) which, if any, of the existing 'serial' algorithms can be adapted for use in the parallel mode. (ii) How close to optimal can such adapted algorithms be and, where relevant, what are the convergence criteria. (iii) How can we design new algorithms specifically for parallel systems. (iv) For multi-processor systems how can we handle the software aspects of the interprocessor communications. Aspects of these questions illustrated by examples are considered in these lectures. (orig.)

The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallelcomputing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement

Proceedings and the Third International Workshop on Applied ParallelComputing in Industrial Problems and Optimization (PARA96)......Proceedings and the Third International Workshop on Applied ParallelComputing in Industrial Problems and Optimization (PARA96)...

Full Text Available The paper considers the monitoring of parallelcomputations for detection of abnormal events. It is assumed that computations are organized according to an event model, and monitoring is based on specific test sequences

This book deals with computationalfluid dynamics with basic and history of numerical fluid dynamics, introduction of finite volume method using one-dimensional heat conduction equation, solution of two-dimensional heat conduction equation, solution of Navier-Stokes equation, fluid with heat transport, turbulent flow and turbulent model, Navier-Stokes solution by generalized coordinate system such as coordinate conversion, conversion of basic equation, program and example of calculation, application of abnormal problem and high speed solution of numerical fluid dynamics.

We describe portable software to simulate universal quantum computers on massive parallelComputers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray

Computer program calculates the steady state fluid distribution, temperature rise, and pressure drop of a coolant, the material temperature distribution of a heat generating solid, and the heat flux distributions at the fluid-solid interfaces. It performs the necessary iterations automatically within the computer, in one machine run.

This book presents major advances in high performance computing as well as major advances due to high performance computing. It contains a collection of papers in which results achieved in the collaboration of scientists from computer science, mathematics, physics, and mechanical engineering are presented. From the science problems to the mathematical algorithms and on to the effective implementation of these algorithms on massively parallel and cluster computers we present state-of-the-art methods and technology as well as exemplary results in these fields. This book shows that problems which seem superficially distinct become intimately connected on a computational level.

The use of parallelism in the solution of wakefield problems is illustrated for two different computer architectures (SIMD and MIMD). Results are given for finite difference codes which have been implemented on a Connection Machine and an Alliant FX/8 and which are used to compute wakefields in dielectric loaded structures. Benchmarks on code performance are presented for both cases. 4 refs., 3 figs., 2 tabs

For almost thirty years, sequential R-matrix computation has been used by atomic physics research groups, from around the world, to model collision phenomena involving the scattering of electrons or positrons with atomic or molecular targets. As considerable progress has been made in the understanding of fundamental scattering processes, new data, obtained from more complex calculations, is of current interest to experimentalists. Performing such calculations, however, places considerable demands on the computational resources to be provided by the target machine, in terms of both processor speed and memory requirement. Indeed, in some instances the computational requirements are so great that the proposed R-matrix calculations are intractable, even when utilising contemporary classic supercomputers. Historically, increases in the computational requirements of R-matrix computation were accommodated by porting the problem codes to a more powerful classic supercomputer. Although this approach has been successful in the past, it is no longer considered to be a satisfactory solution due to the limitations of current (and future) Von Neumann machines. As a consequence, there has been considerable interest in the high performance multicomputers, that have emerged over the last decade which appear to offer the computational resources required by contemporary R-matrix research. Unfortunately, developing codes for these machines is not as simple a task as it was to develop codes for successive classic supercomputers. The difficulty arises from the considerable differences in the computing models that exist between the two types of machine and results in the programming of multicomputers to be widely acknowledged as a difficult, time consuming and error-prone task. Nevertheless, unless parallel R-matrix computation is realised, important theoretical and experimental atomic physics research will continue to be hindered. This thesis describes work that was undertaken in

... and/or distributed systems. The contributions to this book are focused on topics most concerned in the trends of today's parallelcomputing. These range from parallel algorithmics, programming, tools, network computing to future parallelcomputing. Particular attention is paid to parallel numerics: linear algebra, differential equations, numerica...

The parallelcomputing of photon transport is investigated, the parallel algorithm and the parallelization of programs on parallelcomputers both with shared memory and with distributed memory are discussed. By analyzing the inherent law of the mathematics and physics model of photon transport according to the structure feature of parallelcomputers, using the strategy of 'to divide and conquer', adjusting the algorithm structure of the program, dissolving the data relationship, finding parallel liable ingredients and creating large grain parallel subtasks, the sequential computing of photon transport into is efficiently transformed into parallel and vector computing. The program was run on various HP parallelcomputers such as the HY-1 (PVP), the Challenge (SMP) and the YH-3 (MPP) and very good parallel speedup has been gotten

This paper deals with the simulation of 3‐D rotating flows based on the velocity‐vorticity formulation of the Navier‐Stokes equations in cylindrical coordinates. The governing equations are discretized by a finite difference method. The solution is advanced to a new time level by a two‐step process...... is that of solving a singular, large, sparse, over‐determined linear system of equations, and the iterative method CGLS is applied for this purpose. We discuss some of the mathematical and numerical aspects of this procedure and report on the performance of our software on a wide range of parallelcomputers. Darbe...

This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallelcomputing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.

How are they programmed? This article provides an introduction. A parallelcomputer is a network of processors built for ... and have been used to solve problems much faster than a single ... in parallelcomputer design is to select an organization which ..... The most ambitious approach to parallelcomputing is to develop.

The technique of randomization has been employed to solve numerous prob­ lems of computing both sequentially and in parallel. Examples of randomized algorithms that are asymptotically better than their deterministic counterparts in solving various fundamental problems abound. Randomized algorithms have the advantages of simplicity and better performance both in theory and often in practice. This book is a collection of articles written by renowned experts in the area of randomized parallelcomputing. A brief introduction to randomized algorithms In the aflalysis of algorithms, at least three different measures of performance can be used: the best case, the worst case, and the average case. Often, the average case run time of an algorithm is much smaller than the worst case. 2 For instance, the worst case run time of Hoare's quicksort is O(n ), whereas its average case run time is only O( n log n). The average case analysis is conducted with an assumption on the input space. The assumption made to arrive at t...

This book serves as a complete and self-contained introduction to the principles of ComputationalFluid Dynamic (CFD) analysis. It is deliberately short (at approximately 300 pages) and can be used as a text for the first part of the course of applied CFD followed by a software tutorial. The main objectives of this non-traditional format are: 1) To introduce and explain, using simple examples where possible, the principles and methods of CFD analysis and to demystify the `black box’ of a CFD software tool, and 2) To provide a basic understanding of how CFD problems are set and

The increasing availability of asynchronous parallel processors has provided opportunities for original and useful work in scientific computing. However, the field of parallelcomputing is still in a highly volatile state, and researchers display a wide range of opinion about many fundamental questions such as models of parallelism, approaches for detecting and analyzing parallelism of algorithms, and tools that allow software developers and users to make effective use of diverse forms of complex hardware. This volume collects the work of researchers specializing in different aspects of parallelcomputing, who met to discuss the framework and the mechanics of numerical computing. The far-reaching impact of high-performance asynchronous systems is reflected in the wide variety of topics, which include scientific applications (e.g. linear algebra, lattice gauge simulation, ordinary and partial differential equations), models of parallelism, parallel language features, task scheduling, automatic parallelization techniques, tools for algorithm development in parallel environments, and system design issues

In its 3rd revised and extended edition the book offers an overview of the techniques used to solve problems in fluid mechanics on computers and describes in detail those most often used in practice. Included are advanced methods in computationalfluid dynamics, like direct and large-eddy simulation of turbulence, multigrid methods, parallelcomputing, moving grids, structured, block-structured and unstructured boundary-fitted grids, free surface flows. The 3rd edition contains a new section dealing with grid quality and an extended description of discretization methods. The book shows common roots and basic principles for many different methods. The book also contains a great deal of practical advice for code developers and users, it is designed to be equally useful to beginners and experts. The issues of numerical accuracy, estimation and reduction of numerical errors are dealt with in detail, with many examples. A full-feature user-friendly demo-version of a commercial CFD software has been added, which ca...

Describes parallelcomputing and presents inexpensive ways to implement a virtual parallelcomputer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…

Methods, systems, and products are disclosed for broadcasting a message in a parallelcomputer. The parallelcomputer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallelcomputer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallelcomputer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.

The rapid advancement of computational capability including speed and memory size has prompted the wide use of computationalfluid dynamics (CFD) codes to simulate complex flow systems. CFD simulations are used to study the operating problems encountered in system, to evaluate the impacts of operation/design parameters on the performance of a system, and to investigate novel design concepts. CFD codes are generally developed based on the conservation laws of mass, momentum, and energy that govern the characteristics of a flow. The governing equations are simplified and discretized for a selected computational grid system. Numerical methods are selected to simplify and calculate approximate flow properties. For turbulent, reacting, and multiphase flow systems the complex processes relating to these aspects of the flow, i.e., turbulent diffusion, combustion kinetics, interfacial drag and heat and mass transfer, etc., are described in mathematical models, based on a combination of fundamental physics and empirical data, that are incorporated into the code. CFD simulation has been applied to a large variety of practical and industrial scale flow systems.

The MHD, or single fluid, model for a plasma has long been known to provide a surprisingly good description of much of the observed nonlinear dynamics of confined plasmas, considering its simple nature compared to the complexity of the real system. On the other hand, some of the supposed agreement arises from the lack of the detailed measurements that are needed to distinguish MHD from more sophisticated models that incorporate slower time scale processes. At present, a number of factors combine to make models beyond MHD of practical interest. Computational considerations still favor fluid rather than particle models for description of the full plasma, and suggest an approach that starts from a set of fluid-like equations that extends MHD to slower time scales and more accurate parallel dynamics. This paper summarizes a set of two-fluid equations for toroidal (tokamak) geometry that has been developed and tested as the MH3D-T code [1] and some results from the model. The electrons and ions are described as separate fluids. The code and its original MHD version, MH3D [2], are the first numerical, initial value models in toroidal geometry that include the full 3D (fluid) compressibility and electromagnetic effects. Previous nonlinear MHD codes for toroidal geometry have, in practice, neglected the plasma density evolution, on the grounds that MHD plasmas are only weakly compressible and that the background density variation is weaker than the temperature variation. Analytically, the common use of toroidal plasma models based on aspect ratio expansion, such as reduced MHD, has reinforced this impression, since this ordering reduces plasma compressibility effects. For two-fluid plasmas, the density evolution cannot be neglected in principle, since it provides the basic driving energy for the diamagnetic drifts of the electrons and ions perpendicular to the magnetic field. It also strongly influences the parallel dynamics, in combination with the parallel thermal

A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallelcomputer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallelcomputation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallelcomputation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.

A simulation code for hydrodynamical phenomena which is based on the liquid-gas model of lattice-gas fluid is parallelized by using MPI (Message Passing Interface) library. The parallelized code can be applied to the larger size of the simulations than the non-parallelized code. The calculation times of the parallelized code on VPP500 (Vector-Parallel super computer with dispersed memory units), AP3000 (Scalar-parallel server with dispersed memory units), and a workstation cluster decreased in inverse proportion to the number of processors. (author)

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallelcomputation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallelcomputations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

A brief summary of the computer environment used for calculating three dimensional unsteady ComputationalFluid Dynamic (CFD) results is presented. This environment requires a super computer as well as massively parallel processors (MPP's) and clusters of workstations acting as a single MPP (by concurrently working on the same task) provide the required computational bandwidth for CFD calculations of transient problems. The cluster of reduced instruction set computers (RISC) is a recent advent based on the low cost and high performance that workstation vendors provide. The cluster, with the proper software can act as a multiple instruction/multiple data (MIMD) machine. A new set of software tools is being designed specifically to address visualizing 3D unsteady CFD results in these environments. Three user's manuals for the parallel version of Visual3, pV3, revision 1.00 make up the bulk of this report.

Evolutionary algorithms (EAs) are metaheuristics that learn from natural collective behavior and are applied to solve optimization problems in domains such as scheduling, engineering, bioinformatics, and finance. Such applications demand acceptable solutions with high-speed execution using finite computational resources. Therefore, there have been many attempts to develop platforms for running parallel EAs using multicore machines, massively parallel cluster machines, or grid computing environments. Recent advances in general-purpose computing on graphics processing units (GPGPU) have opened u

Phase 1 is complete for the development of a computationalfluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallelcomputing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

Collectively loading an application in a parallelcomputer, the parallelcomputer comprising a plurality of compute nodes, including: identifying, by a parallelcomputer control system, a subset of compute nodes in the parallelcomputer to execute a job; selecting, by the parallelcomputer control system, one of the subset of compute nodes in the parallelcomputer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallelcomputer, the application for executing the job.

Parallelcomputing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

Animating fluids like water, smoke, and fire using physics-based simulation is increasingly important in visual effects, in particular in movies, like The Day After Tomorrow, and in computer games. This book provides a practical introduction to fluid simulation for graphics. The focus is on animating fully three-dimensional incompressible flow, from understanding the math and the algorithms to the actual implementation.

The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallelcomputing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallelcomputing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallelcomputing industry.

In this paper, we show how to efficiently solve large linear systems on parallelcomputers. These linear systems arise from discretization of scientific computing problems described by systems of partial differential equations. We show how to get a discrete finite dimensional system from the continuous problem and the chosen conjugate gradient iterative algorithm is briefly described. Then, the different kinds of parallel architectures are reviewed and their advantages and deficiencies are emphasized. We sketch the problems found in programming the conjugate gradient method on parallelcomputers. For this algorithm to be efficient on parallel machines, domain decomposition techniques are introduced. We give results of numerical experiments showing that these techniques allow a good rate of convergence for the conjugate gradient algorithm as well as computational speeds in excess of a billion of floating point operations per second. (author). 5 refs., 11 figs., 2 tabs., 1 inset

We propose a parallel quantum computing mode for ensemble quantum computer. In this mode, some qubits are in pure states while other qubits are in mixed states. It enables a single ensemble quantum computer to perform 'single-instruction-multidata' type of parallelcomputation. Parallel quantum computing can provide additional speedup in Grover's algorithm and Shor's algorithm. In addition, it also makes a fuller use of qubit resources in an ensemble quantum computer. As a result, some qubits discarded in the preparation of an effective pure state in the Schulman-Varizani and the Cleve-DiVincenzo algorithms can be reutilized

Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallelcomputing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th

Full Text Available The authors describe the development of a 3D parallel Fluid–Structure–Interaction (FSI) solver and its application to benchmark problems. Fluid and solid domains are discretised using and edge-based finite-volume scheme for efficient parallel...

Our goal is to develop software libraries and applications for astrophysical fluid dynamics simulations in multidimensions that will enable us to resolve the large spatial and temporal variations that inevitably arise due to gravity, fronts and microphysical phenomena. The software must run efficiently on parallelcomputers and be general enough to allow the incorporation of a wide variety of physics. Cosmological structure formation with realistic gas physics is the primary application driver in this work. Accurate simulations of e.g. galaxy formation require a spatial dynamic range (i.e., ratio of system scale to smallest resolved feature) of 104 or more in three dimensions in arbitrary topologies. We take this as our technical requirement. We have achieved, and in fact, surpassed these goals.

This paper deals with the simulation of 3‐D rotating flows based on the velocity‐vorticity formulation of the Navier‐Stokes equations in cylindrical coordinates. The governing equations are discretized by a finite difference method. The solution is advanced to a new time level by a two‐step process....... In the first step, the vorticity at the new time level is computed using the velocity at the previous time level. In the second step, the velocity at the new time level is computed using the new vorticity. We discuss here the second part which is by far the most time‐consuming. The numerical problem...

The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

This paper reports that Fermilab's Advanced Computer Program (ACP) has been developing cost effective, yet practical, parallelcomputers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 Mflops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction

In this paper parallelcomputational method in nuclear group constant calculation using collision probability method will be discuss. The main focus is on the calculation of collision matrix which need large amount of computational time. The geometry treated here is concentric cylinder. The calculation of collision probability matrix is carried out using semi analytic method using Beckley Naylor Function. To accelerate computation speed some computerparallel used to solve the problem. We used LINUX based parallelization using PVM software with C or fortran language. While in windows based we used socket programming using DELPHI or C builder. The calculation results shows the important of optimal weight for each processor in case there area many type of processor speed

Locating hardware faults in a parallelcomputer, including defining within a tree network of the parallelcomputer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

Advanced mathematical techniques and computer simulation play a major role in evaluating and enhancing the design of beverage cans, industrial, and transportation containers for improved performance. Numerical models are used to evaluate the impact requirements of containers used by the Department of Energy (DOE) for transporting radioactive materials. Many of these models are highly compute-intensive. An analysis may require several hours of computational time on current supercomputers despite the simplicity of the models being studied. As computer simulations and materials databases grow in complexity, massively parallelcomputers have become important tools. Massively parallelcomputational research at the Oak Ridge National Laboratory (ORNL) and its application to the impact analysis of shipping containers is briefly described in this paper

Internode data communications in a parallelcomputer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.

Methods, apparatus, and products are disclosed for link failure detection in a parallelcomputer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.

The BLAST sequence comparison programs have been ported to a variety of parallelcomputers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.

Changes are needed in the way that visualization is performed, if we expect the analysis of scientific data to be effective at the petascale and beyond. By using similar techniques as those used to parallelize simulations, such as parallel I/O, load balancing, and effective use of interprocess communication, the supercomputers that compute these datasets can also serve as analysis and visualization engines for them. Our team is assessing the feasibility of performing parallel scientific visualization on some of the most powerful computational resources of the U.S. Department of Energy's National Laboratories in order to pave the way for analyzing the next generation of computational results. This paper highlights some of the conclusions of that research.

Changes are needed in the way that visualization is performed, if we expect the analysis of scientific data to be effective at the petascale and beyond. By using similar techniques as those used to parallelize simulations, such as parallel I/O, load balancing, and effective use of interprocess communication, the supercomputers that compute these datasets can also serve as analysis and visualization engines for them. Our team is assessing the feasibility of performing parallel scientific visualization on some of the most powerful computational resources of the U.S. Department of Energy's National Laboratories in order to pave the way for analyzing the next generation of computational results. This paper highlights some of the conclusions of that research.

The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.

The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models

Two papers are included in this progress report. In the first, the compressible Navier-Stokes equations have been used to compute leading edge receptivity of boundary layers over parabolic cylinders. Natural receptivity at the leading edge was simulated and Tollmien-Schlichting waves were observed to develop in response to an acoustic disturbance, applied through the farfield boundary conditions. To facilitate comparison with previous work, all computations were carried out at a free stream Mach number of 0.3. The spatial and temporal behavior of the flowfields are calculated through the use of finite volume algorithms and Runge-Kutta integration. The results are dominated by strong decay of the Tollmien-Schlichting wave due to the presence of the mean flow favorable pressure gradient. The effects of numerical dissipation, forcing frequency, and nose radius are studied. The Strouhal number is shown to have the greatest effect on the unsteady results. In the second paper, a transition model for low-speed flows, previously developed by Young et al., which incorporates first-mode (Tollmien-Schlichting) disturbance information from linear stability theory has been extended to high-speed flow by incorporating the effects of second mode disturbances. The transition model is incorporated into a Reynolds-averaged Navier-Stokes solver with a one-equation turbulence model. Results using a variable turbulent Prandtl number approach demonstrate that the current model accurately reproduces available experimental data for first and second-mode dominated transitional flows. The performance of the present model shows significant improvement over previous transition modeling attempts.

Temporal fringe pattern analysis is invaluable in transient phenomena studies but necessitates long processing times. Here we describe a parallelcomputing strategy based on the single-program multiple-data model and hyperthreading processor technology to reduce the execution time. In a two-node cluster workstation configuration we found that execution periods were reduced by 1.6 times when four virtual processors were used. To allow even lower execution times with an increasing number of processors, the time allocated for data transfer, data read, and waiting should be minimized. Parallelcomputing is found here to present a feasible approach to reduce execution times in temporal fringe pattern analysis

Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

The need for high-performance computing together with the increasing trend from single processor to parallelcomputer architectures has leveraged the adoption of parallelcomputing. To benefit from parallelcomputing power, usually parallel algorithms are defined that can be mapped and executed

The modelling of the greatest part of environmental or industrial flow problems gives very similar types of equations. The considerable increase in computing capacity over the last ten years consequently allowed numerical models of growing complexity to be processed. The varied group of computer codes presented are now a complementary tool of experimental facilities to achieve studies in the field of fluid mechanics. Several codes applied in the nuclear field (reactors, cooling towers, exchangers, plumes...) are presented among others [fr

This work examines parallel VLSI implementations of nondeterministic algorithms. It is demonstrated that conventional pseudorandom number generators are unsuitable for highly parallel applications. Efficient parallel pseudorandom sequence generation can be accomplished using certain classes of elementary one-dimensional cellular automata. The pseudorandom numbers appear in parallel on each clock cycle. Extensive study of the properties of these new pseudorandom number generators is made using standard empirical random number tests, cycle length tests, and implementation considerations. Furthermore, it is shown these particular cellular automata can form the basis of efficient VLSI architectures for computations involved in the Monte Carlo simulation of both the percolation and Ising models from statistical mechanics. Finally, a variation on a Built-In Self-Test technique based upon cellular automata is presented. These Cellular Automata-Logic-Block-Observation (CALBO) circuits improve upon conventional design for testability circuitry.

ComputationalFluid Dynamics in Ventilation Design is a new title in the is a new title in the REHVA guidebook series. The guidebook is written for people who need to use and discuss results based on CFD predictions, and it gives insight into the subject for those who are not used to work with CFD...

Practical applications using massively parallelcomputer hardware first appeared during the 1980s. Their development was motivated by the need for computing power orders of magnitude beyond that available today for tasks such as numerical simulation of complex physical and biological processes, generation of interactive visual displays, satellite image analysis, and knowledge based systems. Representative of the first generation of this new class of computers is the Massively Parallel Processor (MPP). A team of scientists was provided the opportunity to test and implement their algorithms on the MPP. The first results are presented. The research spans a broad variety of applications including Earth sciences, physics, signal and image processing, computer science, and graphics. The performance of the MPP was very good. Results obtained using the Connection Machine and the Distributed Array Processor (DAP) are presented

Parallelcomputing can reduce the running time of the code MCNP effectively. With the MPI message transmitting software, MCNP5 can achieve its parallelcomputing on PC cluster with Windows operating system. Parallelcomputing performance of MCNP is influenced by factors such as the type, the complexity level and the parameter configuration of the computing problem. This paper analyzes the parallelcomputing performance of MCNP regarding with these factors and gives measures to improve the MCNP parallelcomputing performance. (authors)

These proceedings contain the articles presented at the named conference. These concern hardware and software for vector and parallel processors, numerical methods and algorithms for the computation on such processors, as well as applications of such methods to different fields of physics and related sciences. See hints under the relevant topics. (HSI)

rotator, even without the need for isotropic sections. To meet the need for computational power to perform image restoration of virtual tissue sections, parallel programming on GPUs has also been part of the project. This has lead to a significant change in paradigm for a previously developed surgical...

The emergence of high speed wide area networks makes grid computing a reality. However grid applications that need reliable data transfer still have difficulties to achieve optimal TCP performance due to network tuning of TCP window size to improve the bandwidth and to reduce latency on a high speed wide area network. The authors present a pure Java package called JPARSS (Java Parallel Secure Stream) that divides data into partitions that are sent over several parallel Java streams simultaneously and allows Java or Web applications to achieve optimal TCP performance in a gird environment without the necessity of tuning the TCP window size. Several experimental results are provided to show that using parallel stream is more effective than tuning TCP window size. In addition X.509 certificate based single sign-on mechanism and SSL based connection establishment are integrated into this package. Finally a few applications using this package will be discussed

Intranode data communications in a parallelcomputer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.

Intranode data communications in a parallelcomputer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.

The authors have continued to enhance their ability to use new massively parallel processing computers to solve time-domain electromagnetic problems. New vectorization techniques have improved the performance of their code DSI3D by factors of 5 to 15, depending on the computer used. New radiation boundary conditions and far-field transformations now allow the computation of radar cross-section values for complex objects. A new parallel-data extraction code has been developed that allows the extraction of data subsets from large problems, which have been run on parallelcomputers, for subsequent post-processing on workstations with enhanced graphics capabilities. A new charged-particle-pushing version of DSI3D is under development. Finally, DSI3D has become a focal point for several new Cooperative Research and Development Agreement activities with industrial companies such as Lockheed Advanced Development Company, Varian, Hughes Electron Dynamics Division, General Atomic, and Cray

This paper introduces probabilistic choice to synchronous parallel machine models; in particular parallel RAMs. The power of probabilistic choice in parallelcomputations is illustrate by parallelizing some known probabilistic sequential algorithms. The authors characterize the computational complexity of time, space, and processor bounded probabilistic parallel RAMs in terms of the computational complexity of probabilistic sequential RAMs. They show that parallelism uniformly speeds up time bounded probabilistic sequential RAM computations by nearly a quadratic factor. They also show that probabilistic choice can be eliminated from parallelcomputations by introducing nonuniformity

A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallelcomputing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).

The book is aimed at graduate students, researchers, engineers and physicists involved in flow computations. An up-to-date account is given of the present state-of-the-art of numerical methods employed in computationalfluid dynamics. The underlying numerical principles are treated with a fair amount of detail, using elementary mathematical analysis. Attention is given to difficulties arising from geometric complexity of the flow domain and of nonuniform structured boundary-fitted grids. Uniform accuracy and efficiency for singular perturbation problems is studied, pointing the way to accurate computation of flows at high Reynolds number. Much attention is given to stability analysis, and useful stability conditions are provided, some of them new, for many numerical schemes used in practice. Unified methods for compressible and incompressible flows are discussed. Numerical analysis of the shallow-water equations is included. The theory of hyperbolic conservation laws is treated. Godunov's order barrier and ho...

This book analyzes finite element theory as applied to computationalfluid mechanics. It includes a chapter on using the heat conduction equation to expose the essence of finite element theory, including higher-order accuracy and convergence in a common knowledge framework. Another chapter generalizes the algorithm to extend application to the nonlinearity of the Navier-Stokes equations. Other chapters are concerned with the analysis of a specific fluids mechanics problem class, including theory and applications. Some of the topics covered include finite element theory for linear mechanics; potential flow; weighted residuals/galerkin finite element theory; inviscid and convection dominated flows; boundary layers; parabolic three-dimensional flows; and viscous and rotational flows

A fundamental issue which directly impacts the scalability of current theoretical neural network models to massively parallel embodiments, in both software as well as hardware, is the inherent and unavoidable concurrent asynchronicity of emerging fine-grained computational ensembles and the possible emergence of chaotic manifestations. Previous analyses attributed dynamical instability to the topology of the interconnection matrix, to parasitic components or to propagation delays. However, researchers have observed the existence of emergent computational chaos in a concurrently asynchronous framework, independent of the network topology. Researcher present a methodology enabling the effective asynchronous operation of large-scale neural networks. Necessary and sufficient conditions guaranteeing concurrent asynchronous convergence are established in terms of contracting operators. Lyapunov exponents are computed formally to characterize the underlying nonlinear dynamics. Simulation results are presented to illustrate network convergence to the correct results, even in the presence of large delays.

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallelcomputing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well. (paper)

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallelcomputing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.

As part of the Numerical Tokamak Project, the author has developed a (nearly) portable, one dimensional version of the GCPIC algorithm for particle-in-cell codes on parallelcomputers. This algorithm uses a spatial domain decomposition for the fields, and passes particles from one domain to another as the particles move spatially. With only minor changes, the code has been run in parallel on the Intel Delta, the Cray C-90, the IBM ES/9000 and a cluster of workstations. After a line by line translation into cmfortran, the code was also run on the CM-200. Impressive speeds have been achieved, both on the Intel Delta and the Cray C-90, around 30 nanoseconds per particle per time step. In addition, the author was able to isolate the data management modules, so that the physics modules were not changed much from their sequential version, and the data management modules can be used as open-quotes black boxes.close quotes

The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallelcomputing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).

In the probabilistic approach for history matching, the information from the dynamic data is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, fluid properties, well configuration, flow constrains on wells etc. This implies probabilistic approach should then update different regions of the reservoir in different ways. This necessitates delineation of multiple reservoir domains in order to increase the accuracy of the approach. The research focuses on a probabilistic approach to integrate dynamic data that ensures consistency between reservoir models developed from one stage to the next. The algorithm relies on efficient parameterization of the dynamic data integration problem and permits rapid assessment of the updated reservoir model at each stage. The report also outlines various domain decomposition schemes from the perspective of increasing the accuracy of probabilistic approach of history matching. Research progress in three important areas of the project are discussed: {lg_bullet}Validation and testing the probabilistic approach to incorporating production data in reservoir models. {lg_bullet}Development of a robust scheme for identifying reservoir regions that will result in a more robust parameterization of the history matching process. {lg_bullet}Testing commercial simulators for parallel capability and development of a parallel algorithm for history matching.

First results got on massively parallelcomputers (Multiple Instruction Multiple Data and Simple Instruction Multiple Data) allow to consider building of coupled models with high resolutions. This would make possible simulation of thermoaline circulation and other interaction phenomena between atmosphere and ocean. The increasing of computers powers, and then the improvement of resolution will go us to revise our approximations. Then hydrostatic approximation (in ocean circulation) will not be valid when the grid mesh will be of a dimension lower than a few kilometers: We shall have to find other models. The expert appraisement got in numerical analysis at the Center of Limeil-Valenton (CEL-V) will be used again to imagine global models taking in account atmosphere, ocean, ice floe and biosphere, allowing climate simulation until a regional scale

Simulation of viscous three-dimensional fluid flow typically involves a large number of unknowns. When free surfaces are included, the number of unknowns increases dramatically. Consequently, this class of problem is an obvious application of parallel high performance computing. We describe parallelcomputation of viscous, incompressible, free surface, Newtonian fluid flow problems that include dynamic contact fines. The Galerkin finite element method was used to discretize the fully-coupled governing conservation equations and a ''pseudo-solid'' mesh mapping approach was used to determine the shape of the free surface. In this approach, the finite element mesh is allowed to deform to satisfy quasi-static solid mechanics equations subject to geometric or kinematic constraints on the boundaries. As a result, nodal displacements must be included in the set of unknowns. Other issues discussed are the proper constraints appearing along the dynamic contact line in three dimensions. Issues affecting efficient parallel simulations include problem decomposition to equally distribute computational work among a SPMD computer and determination of robust, scalable preconditioners for the distributed matrix systems that must be solved. Solution continuation strategies important for serial simulations have an enhanced relevance in a parallel coquting environment due to the difficulty of solving large scale systems. Parallelcomputations will be demonstrated on an example taken from the coating flow industry: flow in the vicinity of a slot coater edge. This is a three dimensional free surface problem possessing a contact line that advances at the web speed in one region but transitions to static behavior in another region. As such, a significant fraction of the computational time is devoted to processing boundary data. Discussion focuses on parallel speed ups for fixed problem size, a class of problems of immediate practical importance

The authors present a new method for computing solutions of conservation laws based on the use of cellular automata with the method of characteristics. The method exploits the high degree of parallelism available with cellular automata and retains important features of the method of characteristics. It yields high numerical accuracy and extends naturally to adaptive meshes and domain decomposition methods for perturbed conservation laws. They describe the method and its implementation for a Dirichlet problem with a single conservation law for the one-dimensional case. Numerical results for the one-dimensional law with the classical Burgers nonlinearity or the Buckley-Leverett equation show good numerical accuracy outside the neighborhood of the shocks. The error in the area of the shocks is of the order of the mesh size. The algorithm is well suited for execution on both massively parallelcomputers and vector machines. They present timing results for an Alliant FX/8, Connection Machine Model 2, and CRAY X-MP.

A cluster for parallelcomputation with MATLAB software the COCGT - Cluster for Optimizing Computing in Gamma ray Transmission methods, is implemented. The implementation correspond to creation of a local net of computers, facilities and configurations of software, as well as the accomplishment of cluster tests for determine and optimizing of performance in the data processing. The COCGT implementation was required by data computation from gamma transmission measurements applied to fluid dynamic and tomography reconstruction in a FCC-Fluid Catalytic Cracking cold pilot unity, and simulation data as well. As an initial test the determination of SVD - Singular Values Decomposition - of random matrix with dimension (n , n), n=1000, using the Girco's law modified, revealed that COCGT was faster in comparison to the literature [1] cluster, which is similar and operates at the same conditions. Solution of a system of linear equations provided a new test for the COCGT performance by processing a square matrix with n=10000, computing time was 27 s and for square matrix with n=12000, computation time was 45 s. For determination of the cluster behavior in relation to 'parfor' (parallel for-loop) and 'spmd' (single program multiple data), two codes were used containing those two commands and the same problem: determination of SVD of a square matrix with n= 1000. The execution of codes by means of COCGT proved: 1) for the code with 'parfor', the performance improved with the labs number from 1 to 8 labs; 2) for the code 'spmd', just 1 lab (core) was enough to process and give results in less than 1 s. In similar situation, with the difference that now the SVD will be determined from square matrix with n1500, for code with 'parfor', and n=7000, for code with 'spmd'. That results take to conclusions: 1) for the code with 'parfor', the behavior was the same already described above; 2) for code with 'spmd', the same besides having produced a larger performance, it supports a

The Illiac IV is a parallel-structure computer with computing power an order of magnitude greater than that of conventional computers. It can be used for experimental tasks in fluid dynamics which can be simulated more economically, for simulating flows that cannot be studied by experiment, and for combining computer and experimental simulations. The architecture of Illiac IV is described, and the use of its parallel operation is demonstrated on the example of its solution of the one-dimensional wave equation. For fluid dynamics problems, a special FORTRAN-like vector programming language was devised, called CFD language. Two applications are described in detail: (1) the determination of the flowfield around the space shuttle, and (2) the computation of transonic turbulent separated flow past a thick biconvex airfoil.

The book is aimed at graduate students, researchers, engineers and physicists involved in flow computations. An up-to-date account is given of the present state- of-the-art of numerical methods employed in computationalfluid dynamics. The underlying numerical principles are treated with a fair amount of detail, using elementary mathematical analysis. Attention is given to difficulties arising from geometric complexity of the flow domain and of nonuniform structured boundary-fitted grids. Uniform accuracy and efficiency for singular perturbation problems is studied, pointing the way to accurate computation of flows at high Reynolds number. Much attention is given to stability analysis, and useful stability conditions are provided, some of them new, for many numerical schemes used in practice. Unified methods for compressible and incompressible flows are discussed. Numerical analysis of the shallow-water equations is included. The theory of hyperbolic conservation laws is treated. Godunov's order barrier and how to overcome it by means of slope-limited schemes is discussed. An introduction is given to efficient iterative solution methods, using Krylov subspace and multigrid acceleration. Many pointers are given to recent literature, to help the reader to quickly reach the current research frontier. (orig.)

Advances in computationalfluid dynamics (CFD), turbulence simulation, and parallelcomputing have made feasible the development of three-dimensional (3-D) single-phase and two-phase flow CFD codes that can simulate fluid flow and heat transfer in realistic reactor geometries with significantly reduced reliance, especially in single phase, on empirical correlations. The objective of this work was to assess the predictive power and computational efficiency of a CFD code in the analysis of a challenging single-phase light water reactor problem, as well as to identify areas where further improvements are needed

Evaporation is ubiquitous in nature, but very few attempts have been made in the past to couple the effects of evaporation with fluid flow behavior. In this theoretical paper we have discussed the effects of evaporation on the dynamics of steady state thermocapillary convection in a two-dimensional rectangular container. The liquid is heated by differentially heated sidewalls and mass loss from the interface due to evaporation is compensated by the liquid entering into the container through a lower inlet, thus keeping the thickness of the liquid layer constant. We show that for an evaporating liquid one can obtain a plane parallel base state profile which depends on the evaporative mass flux.

This article describes methods of external sorting in which the entire main computer memory is used for the internal sorting of entries, forming out of them sorted segments of the greatest possible size, and outputting them to external memories. The obtained segments are merged into larger segments until all entries form one ordered segment. The described methods are suitable for sequential files stored on magnetic tape. The needs of the sorting algorithm can be met by using the relatively slow peripheral storage devices (e.g., tapes, disks, drums). The efficiency of the external sorting methods is determined by calculating the total sorting time as a function of the number of entries to be sorted and the number of parallel processors participating in the sorting process

Contact-impact algorithms on parallelcomputers are discussed within the context of explicit finite element analysis. The algorithms concerned include a contact searching algorithm and an algorithm for contact force calculations. The contact searching algorithm is based on the territory concept of the general HITA algorithm. However, no distinction is made between different contact bodies, or between different contact surfaces. All contact segments from contact boundaries are taken as a single set. Hierarchy territories and contact territories are expanded. A three-dimensional bucket sort algorithm is used to sort contact nodes. The defence node algorithm is used in the calculation of contact forces. Both the contact searching algorithm and the defence node algorithm are implemented on the connection machine CM-200. The performance of the algorithms is examined under different circumstances, and numerical results are presented. ((orig.))

Fresh groundwater reserves in coastal aquifers are threatened by sea-level rise, extreme weather conditions, increasing urbanization and associated groundwater extraction rates. To counteract these threats, accurate high-resolution numerical models are required to optimize the management of these precious reserves. The major model drawbacks are long run times and large memory requirements, limiting the predictive power of these models. Distributed memory parallelcomputing is an efficient technique for reducing run times and memory requirements, where the problem is divided over multiple processor cores. A new Parallel Krylov Solver (PKS) for SEAWAT is presented. PKS has recently been applied to MODFLOW and includes Conjugate Gradient (CG) and Biconjugate Gradient Stabilized (BiCGSTAB) linear accelerators. Both accelerators are preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using Recursive Coordinate Bisection (RCB) load balancing, b) each subdomain uses local memory only and communicates with other subdomains by Message Passing Interface (MPI) within the linear accelerator, c) it is fully integrated in SEAWAT. Within SEAWAT, the PKS-CG solver replaces the Preconditioned Conjugate Gradient (PCG) solver for solving the variable-density groundwater flow equation and the PKS-BiCGSTAB solver replaces the Generalized Conjugate Gradient (GCG) solver for solving the advection-diffusion equation. PKS supports the third-order Total Variation Diminishing (TVD) scheme for computing advection. Benchmarks were performed on the Dutch national supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 128 cores, for a synthetic 3D Henry model (100 million cells) and the real-life Sand Engine model ( 10 million cells). The Sand Engine model was used to investigate the potential effect of the long-term morphological evolution of a large sand replenishment and climate change on fresh groundwater resources

This paper describes a new algorithm for the computation of two-dimensional resistive magnetohydrodynamic (MHD) and two-fluid studies of magnetic reconnection in plasmas. It has been implemented on several parallel platforms and shows good scalability up to 32 CPUs for reasonable problem sizes. A fixed, nonuniform rectangular mesh is used to resolve the different spatial scales in the reconnection problem. The resistive MHD version of the code uses an implicit/explicit hybrid method, while the two-fluid version uses an alternating-direction implicit (ADI) method. The technique has proven useful for comparing several different theories of collisional and collisionless reconnection

Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallelcomputer. The parallelcomputer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallelcomputer. Broadcasting collective operation contributions throughout a parallelcomputer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

Parallel programming is applied to multiple processors connected in Ethernet. Data exchanges between tasks located in each processing element are realized by two ways. One is socket which is standard library on recent UNIX operating systems. Another is a network connecting software, named as Parallel Virtual Machine (PVM) which is a free software developed by ORNL, to use many workstations connected to network as a parallelcomputer. This paper discusses the availability of parallelcomputing using network and UNIX workstations and comparison between specialized parallel systems (Transputer and iPSC/860) in a Monte Carlo simulation which generally shows high parallelization ratio. (author)

This paper describes the trend of parallelcomputers used in geophysical exploration. Around 1993 was the early days when the parallelcomputers began to be used for geophysical exploration. Classification of these computers those days was mainly MIMD (multiple instruction stream, multiple data stream), SIMD (single instruction stream, multiple data stream) and the like. Parallelcomputers were publicized in the 1994 meeting of the Geophysical Exploration Society as a `high precision imaging technology`. Concerning the library of parallelcomputers, there was a shift to PVM (parallel virtual machine) in 1993 and to MPI (message passing interface) in 1995. In addition, the compiler of FORTRAN90 was released with support implemented for data parallel and vector computers. In 1993, networks used were Ethernet, FDDI, CDDI and HIPPI. In 1995, the OC-3 products under ATM began to propagate. However, ATM remains to be an interoffice high speed network because the ATM service has not spread yet for the public network. 1 ref.

Some reactive scattering codes have been ported on different innovative computer architectures ranging from massively parallel machines to clustered workstations. The porting has required a drastic restructuring of the codes to single out computationally decoupled cpu intensive subsections. The suitability of different theoretical approaches for parallel and distributed computing restructuring is discussed and the efficiency of related algorithms evaluated

Mapping parallel algorithms to parallelcomputing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the algorithm to the logical configuration platform and the implementation of the

Three approaches for using computers to improve basic fluids engineering education are presented. The use of computationalfluid dynamics solutions to fundamental flow problems is discussed. The use of interactive, highly graphical software which operates on either a modern workstation or personal computer is highlighted. And finally, the development of 'textbooks' and teaching aids which are used and distributed on the World Wide Web is described. Arguments for and against this technology as applied to undergraduate education are also discussed.

Fluid Dynamics Theory, Computation, and Numerical Simulation is the only available book that extends the classical field of fluid dynamics into the realm of scientific computing in a way that is both comprehensive and accessible to the beginner The theory of fluid dynamics, and the implementation of solution procedures into numerical algorithms, are discussed hand-in-hand and with reference to computer programming This book is an accessible introduction to theoretical and computationalfluid dynamics (CFD), written from a modern perspective that unifies theory and numerical practice There are several additions and subject expansions in the Second Edition of Fluid Dynamics, including new Matlab and FORTRAN codes Two distinguishing features of the discourse are solution procedures and algorithms are developed immediately after problem formulations are presented, and numerical methods are introduced on a need-to-know basis and in increasing order of difficulty Matlab codes are presented and discussed for a broad...

Fluid Dynamics: Theory, Computation, and Numerical Simulation is the only available book that extends the classical field of fluid dynamics into the realm of scientific computing in a way that is both comprehensive and accessible to the beginner. The theory of fluid dynamics, and the implementation of solution procedures into numerical algorithms, are discussed hand-in-hand and with reference to computer programming. This book is an accessible introduction to theoretical and computationalfluid dynamics (CFD), written from a modern perspective that unifies theory and numerical practice. There are several additions and subject expansions in the Second Edition of Fluid Dynamics, including new Matlab and FORTRAN codes. Two distinguishing features of the discourse are: solution procedures and algorithms are developed immediately after problem formulations are presented, and numerical methods are introduced on a need-to-know basis and in increasing order of difficulty. Matlab codes are presented and discussed for ...

CFD is the shortname for ComputationalFluid Dynamics and is a numerical method by means of which we can analyze systems containing fluids. For instance systems dealing with heat flow or smoke control systems acting when a fire occur in a building.......CFD is the shortname for ComputationalFluid Dynamics and is a numerical method by means of which we can analyze systems containing fluids. For instance systems dealing with heat flow or smoke control systems acting when a fire occur in a building....

Data communications in a parallel active messaging interface (`PAMI`) of a parallelcomputer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.

Terrestrial ecosystems are a primary component of research on global environmental change. Observational and modeling research on terrestrial ecosystems at the global scale, however, has lagged behind their counterparts for oceanic and atmospheric systems, largely because the unique challenges associated with the tremendous diversity and complexity of terrestrial ecosystems. There are 8 major types of terrestrial ecosystem: tropical rain forest, savannas, deserts, temperate grassland, deciduous forest, coniferous forest, tundra, and chaparral. The carbon cycle is an important mechanism in the coupling of terrestrial ecosystems with climate through biological fluxes of CO 2 . The influence of terrestrial ecosystems on atmospheric CO 2 can be modeled via several means at different timescales. Important processes include plant dynamics, change in land use, as well as ecosystem biogeography. Over the past several decades, many terrestrial ecosystem models (see the 'Model developments' section) have been developed to understand the interactions between terrestrial carbon storage and CO 2 concentration in the atmosphere, as well as the consequences of these interactions. Early TECMs generally adapted simple box-flow exchange models, in which photosynthetic CO 2 uptake and respiratory CO 2 release are simulated in an empirical manner with a small number of vegetation and soil carbon pools. Demands on kinds and amount of information required from global TECMs have grown. Recently, along with the rapid development of parallelcomputing, spatially explicit TECMs with detailed process based representations of carbon dynamics become attractive, because those models can readily incorporate a variety of additional ecosystem processes (such as dispersal, establishment, growth, mortality etc.) and environmental factors (such as landscape position, pest populations, disturbances, resource manipulations, etc.), and provide information to frame policy options for climate change

It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....

...] to support the design, prototyping, and simulation of parallel, distributed computations. In particular, VISA is meant to guide the choice of partitioning and communication strategies for such computations, based on their performance...

boundary value problem for each velocity component, are solved by the conjugate gradient method with a preconditioning based on the algebraic multi‐level iteration (AMLI) technique. The velocity is found from the computed pressure. The method is optimal in the sense that the computational work...... is proportional to the number of unknowns. Further, it is designed to exploit a massively parallelcomputer with distributed memory architecture. Numerical experiments on a Cray T3E computer illustrate the parallel performance of the method....

Full Text Available Numerical simulations of 1D Burgers equation and 2D sloshing problem were carried out to study numerical discrepancy between serial and parallelcomputations. The numerical domain was decomposed into 2 and 4 subdomains for parallelcomputations with message passing interface. The numerical solution of Burgers equation disclosed that fully explicit boundary conditions used on subdomains of parallelcomputation was responsible for the numerical discrepancy of transient solution between serial and parallelcomputations. Two dimensional sloshing problems in a rectangular domain were solved using OpenFOAM. After a lapse of initial transient time sloshing patterns of water were significantly different in serial and parallelcomputations although the same numerical conditions were given. Based on the histograms of pressure measured at two points near the wall the statistical characteristics of numerical solution was not affected by the number of subdomains as much as the transient solution was dependent on the number of subdomains.

Recently, applications of parallelcomputation for scientific calculations have increased from the need of the high speed calculation of large scale programs. At the JAERI computing center, an array processor FACOM 230-75 APU has installed to study the applicability of parallelcomputation for nuclear codes. We made some numerical experiments by using the APU on the methods of solution of tridiagonal linear equation which is an important problem in scientific calculations. Referring to the recent papers with parallel methods, we investigate eight ones. These are Gauss elimination method, Parallel Gauss method, Accelerated parallel Gauss method, Jacobi method, Recursive doubling method, Cyclic reduction method, Chebyshev iteration method, and Conjugate gradient method. The computing time and accuracy were compared among the methods on the basis of the numerical experiments. As the result, it is found that the Cyclic reduction method is best both in computing time and accuracy and the Gauss elimination method is the second one. (author)

During the design of the NERVA turbopump, numerous computer programs were developed for the analyses of fluid dynamic problems within the machine. Program descriptions, example cases, users instructions, and listings for the majority of these programs are presented.

Cloud computing is the development of parallelcomputing, distributed computing and grid computing. The development of cloud computing makes parallelcomputing come into people’s lives. Firstly, this paper expounds the concept of cloud computing and introduces two several traditional parallel programming model. Secondly, it analyzes and studies the principles, advantages and disadvantages of OpenMP, MPI and Map Reduce respectively. Finally, it takes MPI, OpenMP models compared to Map Reduce from the angle of cloud computing. The results of this paper are intended to provide a reference for the development of parallelcomputing.

Data communications in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the parallelcomputer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.

This book provides an accessible introduction to the basic theory of fluid mechanics and computationalfluid dynamics (CFD) from a modern perspective that unifies theory and numerical computation. Methods of scientific computing are introduced alongside with theoretical analysis and MATLAB® codes are presented and discussed for a broad range of topics: from interfacial shapes in hydrostatics, to vortex dynamics, to viscous flow, to turbulent flow, to panel methods for flow past airfoils. The third edition includes new topics, additional examples, solved and unsolved problems, and revised images. It adds more computational algorithms and MATLAB programs. It also incorporates discussion of the latest version of the fluid dynamics software library FDLIB, which is freely available online. FDLIB offers an extensive range of computer codes that demonstrate the implementation of elementary and advanced algorithms and provide an invaluable resource for research, teaching, classroom instruction, and self-study. This ...

Simulating fast transient phenomena involving fluids and structures in interaction for safety purposes requires both accurate and robust algorithms, and parallelcomputing to reduce the calculation time for industrial models. Managing kinematic constraints linking fluid and structural entities is thus a key issue and this contribution promotes a dual approach over the classical penalty approach, introducing arbitrary coefficients in the solution. This choice however severely increases the complexity of the problem, mainly due to non-permanent kinematic constraints. An innovative parallel strategy is therefore described, whose performances are demonstrated on significant examples exhibiting the full complexity of the target industrial simulations. (authors)

Past and current exploitation of parallelcomputing in High Energy Physics is summarized and a list of R & D projects in this area is presented. The applicability of new parallel hardware and software to physics problems is investigated, in the light of the requirements for computing power of LHC experiments and the current trends in the computer industry. Four main themes are discussed (possibilities for a finer grain of parallelism; fine-grain communication mechanism; usable parallel programming environment; different programming models and architectures, using standard commercial products). Parallelcomputing technology is potentially of interest for offline and vital for real time applications in LHC. A substantial investment in applications development and evaluation of state of the art hardware and software products is needed. A solid development environment is required at an early stage, before mainline LHC program development begins.

Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallelcomputing. CUDA (compute unified device architecture) is a parallelcomputing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).

Partitions of entries (linguistic structures) are studied that are intended for parallel data processing. The representations of formal languages with the aid of such structures is examined, and the relationships are considered between partitions of entries and abstract families of languages and automata. 18 references.

Parallel sorting techniques have become of practical interest with the advent of new multiprocessor architectures. The decreasing cost of these processors will probably in the future, make the solutions that are derived thereof to be more appealing. Efficient algorithms for sorting scheme that are encountered in a number of ...

This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.

A task of random size T is split into M subtasks of lengths T1, . . . , TM, each of which is sent to one out of M parallel processors. Each processor may fail at a random time before completing its allocated task, and then has to restart it from the beginning. If X1, . . . ,XM are the total task ...

General-purpose Monte Carlo codes MVP/GMVP are well-vectorized and thus enable us to perform high-speed Monte Carlo calculations. In order to achieve more speedups, we parallelized the codes on the different types of parallelcomputing platforms or by using a standard parallelization library MPI. The platforms used for benchmark calculations are a distributed-memory vector-parallelcomputer Fujitsu VPP500, a distributed-memory massively parallelcomputer Intel paragon and a distributed-memory scalar-parallelcomputer Hitachi SR2201, IBM SP2. As mentioned generally, linear speedup could be obtained for large-scale problems but parallelization efficiency decreased as the batch size per a processing element(PE) was smaller. It was also found that the statistical uncertainty for assembly powers was less than 0.1% by the PWR full-core calculation with more than 10 million histories and it took about 1.5 hours by massively parallelcomputing. (author)

Parallelcomputation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallelcomputation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.

Computing requirement of large scale scientific computing has always been ahead of what state of the art hardware could supply in the form of supercomputers of the day. And for any single processor system the limit to increase in the computing power was realized a few years back itself. Now with the advent of parallelcomputing systems the availability of machines with the required computing power seems a reality. In this paper the author tries to visualize the future large scale scientific computing in the penultimate decade of the present century. The author summarized trends in parallelcomputers and emphasize the need for a better programming environment and software tools for optimal performance. The author concludes this paper with critique on parallel architectures, software tools and algorithms. (author). 10 refs., 2 tabs

Full Text Available Kary Ocaña,1 Daniel de Oliveira2 1National Laboratory of Scientific Computing, Petrópolis, Rio de Janeiro, 2Institute of Computing, Fluminense Federal University, Niterói, Brazil Abstract: Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallelcomputing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. Keywords: high-performance computing, genomic research, cloud computing, grid computing, cluster computing, parallelcomputing

In this paper we describe the theory and application of a parallel mesh optimisation procedure to obtain self-adapting finite element solutions on unstructured tetrahedral grids. The optimisation procedure adapts the tetrahedral mesh to the solution of a radiation transport or fluid flow problem without sacrificing the integrity of the boundary (geometry), or internal boundaries (regions) of the domain. The objective is to obtain a mesh which has both a uniform interpolation error in any direction and the element shapes are of good quality. This is accomplished with use of a non-Euclidean (anisotropic) metric which is related to the Hessian of the solution field. Appropriate scaling of the metric enables the resolution of multi-scale phenomena as encountered in transient incompressible fluids and multigroup transport calculations. The resulting metric is used to calculate element size and shape quality. The mesh optimisation method is based on a series of mesh connectivity and node position searches of the landscape defining mesh quality which is gauged by a functional. The mesh modification thus fits the solution field(s) in an optimal manner. The parallel mesh optimisation/adaptivity procedure presented in this paper is of general applicability. We illustrate this by applying it to a transient CFD (computationalfluid dynamics) problem. Incompressible flow past a cylinder at moderate Reynolds numbers is modelled to demonstrate that the mesh can follow transient flow features. (authors)

A Vlasov-Poisson-system is used for studying the time evolution of the charge-separation at a spatial one- as well as a two-dimensional plasma-edge. Ions are advanced in time using the Vlasov-equation. The whole three-dimensional velocity-space is considered leading to very time-consuming four-resp. five-dimensional fully kinetic simulations. In the 1D simulations electrons are assumed to behave adiabatic, i.e. they are Boltzmann-distributed, leading to a nonlinear Poisson-equation. In the 2D simulations a gyro-kinetic approximation is used for the electrons. The plasma is assumed to be initially neutral. The simulations are performed at an equidistant grid. A constant time-step is used for advancing the density-distribution function in time. The time-evolution of the distribution function is performed using a splitting scheme. Each dimension (x, y, υ x , υ y , υ z ) of the phase-space is advanced in time separately. The value of the distribution function for the next time is calculated from the value of an - in general - interstitial point at the present time (fractional shift). One-dimensional cubic-spline interpolation is used for calculating the interstitial function values. After the fractional shifts are performed for each dimension of the phase-space, a whole time-step for advancing the distribution function is finished. Afterwards the charge density is calculated, the Poisson-equation is solved and the electric field is calculated before the next time-step is performed. The fractional shift method sketched above was parallelized for p processors as follows. Considering first the shifts in y-direction, a proper parallelization strategy is to split the grid into p disjoint υ z -slices, which are sub-grids, each containing a different 1/p-th part of the υ z range but the whole range of all other dimensions. Each processor is responsible for performing the y-shifts on a different slice, which can be done in parallel without any communication between

To compare the performance and use of three massively parallel SIMD computers, we implemented a large air pollution model on the computers. Using a realistic large-scale model, we gain detailed insight about the performance of the three computers when used to solve large-scale scientific problems...

The focus of the research was on developing parallelcomputing algorithm for solving Eigen-values of the Boltzmam Neutron Transport Equation (BNTE) in a slab geometry using multi-grid approach. In response to the problem of slow execution of serial computing when solving large problems, such as BNTE, the study was focused on the design of parallelcomputing systems which was an evolution of serial computing that used multiple processing elements simultaneously to solve complex physical and mathematical problems. Finite element method (FEM) was used for the spatial discretization scheme, while angular discretization was accomplished by expanding the angular dependence in terms of Legendre polynomials. The eigenvalues representing the multiplication factors in the BNTE were determined by the power method. MATLAB Compiler Version 4.1 (R2009a) was used to compile the MATLAB codes of BNTE. The implemented parallel algorithms were enabled with matlabpool, a ParallelComputing Toolbox function. The option UseParallel was set to 'always' and the default value of the option was 'never'. When those conditions held, the solvers computed estimated gradients in parallel. The parallelcomputing system was used to handle all the bottlenecks in the matrix generated from the finite element scheme and each domain of the power method generated. The parallel algorithm was implemented on a Symmetric Multi Processor (SMP) cluster machine, which had Intel 32 bit quad-core x 86 processors. Convergence rates and timings for the algorithm on the SMP cluster machine were obtained. Numerical experiments indicated the designed parallel algorithm could reach perfect speedup and had good stability and scalability. (au)

Colour is used in computationalfluid dynamic (CFD) simulations in two key ways. First it is used to visualise the geometry and allow the engineers to be confident that the model constructed is a good representation of the engineering situation. Once an analysis has been completed, colour is used in post-processing the data from the simulations to illustrate the complex fluid mechanic phenomena under investigation. This paper describes these two uses of colour and provides some examples to il...

Ocean modeling plays an important role in both understanding the current climatic conditions and predicting future climate change. However, modeling the ocean circulation at various spatial and temporal scales is a very challenging computational task.

on the mass fraction transport equation. The importance of ?false? or numerical diffusion is also addressed in connection with the simple description of a supply opening. The different aspects of boundary conditions in the indoor environment as e.g. the simulation of Air Terminal Devices and the simulation......Nielsen, P.V. ComputationalFluid Dynamics and Room Air Movement. Indoor Air, International Journal of Indoor Environment and Health, Vol. 14, Supplement 7, pp. 134-143, 2004. ABSTRACT ComputationalFluid Dynamics (CFD) and new developments of CFD in the indoor environment as well as quality...... considerations are important elements in the study of energy consumption, thermal comfort and indoor air quality in buildings. The paper discusses the quality level of ComputationalFluid Dynamics and the involved schemes (first, second and third order schemes) by the use of the Smith and Hutton problem...

Parallelcomputing has been recognized as a solution to large computing problems. In High Energy Physics offline event reconstruction of detector data is a very large computing problem that has been solved with parallelcomputing techniques. A review of the parallel programming package CPS (Cooperative Processes Software) developed and used at Fermilab for offline reconstruction of Terabytes of data requiring the delivery of hundreds of Vax-Years per experiment is given. The Fermilab UNIX farms, consisting of 180 Silicon Graphics workstations and 144 IBM RS6000 workstations, are used to provide the computing power for the experiments. Fermilab has had a long history of providing production parallelcomputing starting with the ACP (Advanced Computer Project) Farms in 1986. The Fermilab UNIX Farms have been in production for over 2 years with 24 hour/day service to experimental user groups. Additional tools for management, control and monitoring these large systems will be described. Possible future directions for parallelcomputing in High Energy Physics will be given

The implementation and performance of a finite-difference algorithm for the compressible Navier-Stokes equations in two or three dimensions on the Connection Machine are described. This machine is a single-instruction multiple-data machine with up to 65536 physical processors. The implicit portion of the algorithm is of particular interest. Running times and megadrop rates are given for two- and three-dimensional problems. Included are comparisons with the standard codes on a Cray X-MP/48. 15 refs

In some fields such as meteorology, theoretical physics, quantum chemistry and hydrodynamics there are problems which involve so much computation that computers of the power of a thousand times a Cray 2 could be fully utilised if they were available. Since it is unlikely that uniprocessors of such power will be available, such large scale problems could be solved by using systems of computers running in parallel. This approach, of course, requires to find appropriate algorithms for the solution of such problems which can efficiently make use of a large number of computers working in parallel. 11 refs, 10 figs, 1 tab

This paper examines some activities in High Energy Physics that utilise parallelcomputing. The topic includes all computing from the proposed SIMD front end detectors, the farming applications, high-powered RISC processors and the large machines in the computer centers. We start by looking at the motivation behind using parallelism for general purpose computing. The developments around farming are then described from its simplest form to the more complex system in Fermilab. Finally, there is a list of some developments that are happening close to the experiments.

""I have always considered this book the best gift from one generation to the next in computationalfluid dynamics. I earnestly recommend this book to graduate students and practicing engineers for the pleasure of learning and a handy reference. The description of the basic concepts and fundamentals is thorough and is crystal clear for understanding. And since 1984, two newer editions have kept abreast to the new, relevant, and fully verified advancements in CFD.""-Joseph J.S. Shang, Wright State University""ComputationalFluid Mechanics and Heat Transfer is very well written to be used as a t

The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)

Numerical simulation allows the theorists to convince themselves about the validity of the models they use. Particularly by simulating the spin lattices one can judge about the validity of a conjecture. Simulating a system defined by a large number of degrees of freedom requires highly sophisticated machines. This study deals with modelling the magnetic interactions between the ions of a crystal. Many exact results have been found for spin 1/2 systems but not for systems of other spins for which many simulation have been carried out. The interest for simulations has been renewed by the Haldane's conjecture stipulating the existence of a energy gap between the ground state and the first excited states of a spin 1 lattice. The existence of this gap has been experimentally demonstrated. This report contains the following four chapters: 1. Spin systems; 2. Calculation of eigenvalues; 3. Programming; 4. Parallel calculation

If we think of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library: We recognize a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. This paper is about how to construct a computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do. Such a memory is remarkably similar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. The paper concludes that the frame problem of artificial intelligence could be solved by the use of such a memory if we were able to encode information about the world properly.

In a previous humorous note entitled 'Twelve Ways to Fool the Masses,' I outlined twelve common ways in which performance figures for technical computer systems can be distorted. In this paper and accompanying conference talk, I give a reprise of these twelve 'methods' and give some actual examples that have appeared in peer-reviewed literature in years past. I then propose guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion, not only in the world of device simulation but also in the larger arena of technical computing.

Full Text Available A fast technological progress is providing seismic tomographers with computers
of rapidly increasing speed and RAM, that are not always properly taken
advantage of. Large computers with both shared-memory and distributedmemory
architectures have made it possible to approach the tomographic
inverse problem more accurately. For example, resolution can be quantified
from the resolution matrix rather than checkerboard tests; the covariance
matrix can be calculated to evaluate the propagation of errors from data to
model parameters; the L-curve method can be applied to determine a range
of acceptable regularization schemes. We show how these exercises can be
implemented efficiently on different hardware architectures.

Fluids adsorbed near surfaces, macromolecules, and in porous materials are inhomogeneous, inhibiting spatially varying density distributions. This inhomogeneity in the fluid plays an important role in controlling a wide variety of complex physical phenomena including wetting, self-assembly, corrosion, and molecular recognition. One of the key methods for studying the properties of inhomogeneous fluids in simple geometries has been density functional theory (DFT). However, there has been a conspicuous lack of calculations in complex 2D and 3D geometries. The computational difficulty arises from the need to perform nested integrals that are due to nonlocal terms in the free energy functional These integral equations are expensive both in evaluation time and in memory requirements; however, the expense can be mitigated by intelligent algorithms and the use of parallelcomputers. This paper details our efforts to develop efficient numerical algorithms so that no local DFT calculations in complex geometries that require two or three dimensions can be performed. The success of this implementation will enable the study of solvation effects at heterogeneous surfaces, in zeolites, in solvated (bio)polymers, and in colloidal suspensions.

Fluids adsorbed near surfaces, near macromolecules, and in porous materials are inhomogeneous, exhibiting spatially varying density distributions. This inhomogeneity in the fluid plays an important role in controlling a wide variety of complex physical phenomena including wetting, self-assembly, corrosion, and molecular recognition. One of the key methods for studying the properties of inhomogeneous fluids in simple geometries has been density functional theory (DFT). However, there has been a conspicuous lack of calculations in complex two- and three-dimensional geometries. The computational difficulty arises from the need to perform nested integrals that are due to nonlocal terms in the free energy functional. These integral equations are expensive both in evaluation time and in memory requirements; however, the expense can be mitigated by intelligent algorithms and the use of parallelcomputers. This paper details the efforts to develop efficient numerical algorithms so that nonlocal DFT calculations in complex geometries that require two or three dimensions can be performed. The success of this implementation will enable the study of solvation effects at heterogeneous surfaces, in zeolites, in solvated (bio)polymers, and in colloidal suspensions

As the number and variety of computers within Los Alamos Scientific Laboratory's Central Computer Facility grows, the need for a standard, high-speed intercomputer interface has become more apparent. This report details the development of a High-Speed Parallel Interface from conceptual through implementation stages to meet current and future needs for large-scle network computing within the Integrated Computer Network. 4 figures

The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallelcomputation.

Computationalfluid dynamics (CFD) is one discipline falling under the broad heading of computer-aided engineering (CAE). CAE, together with computer-aided design (CAD) and computer-aided manufacturing (CAM), comprise a mathematical-based approach to engineering product and process design, analysis and fabrication. In this overview of CFD for the design engineer, our purposes are three-fold: (1) to define the scope of CFD and motivate its utility for engineering, (2) to provide a basic technical foundation for CFD, and (3) to convey how CFD is incorporated into engineering product and process design.

Finite element field analysis algorithms lend themselves to parallelization and this fact is exploited in this paper to implement a finite element analysis program for electromagnetic field computation on the Sequent Symmetry 81 parallelcomputer with three processors. In terms of waiting time, the maximum gains are to be made in matrix solution and therefore this paper concentrates on the gains in parallelizing the solution part of finite element analysis. An outline of how parallelization could be exploited in most finite element operations is given in this paper although the actual implemention of parallelism on the Sequent Symmetry 81 parallelcomputer was in sparsity computation, matrix assembly and the matrix solution areas. In all cases, the algorithms were modified suit the parallel programming application rather than allowing the compiler to parallelize on existing algorithms

There are no comprehensive simulation-based tools for engineering the flows of viscoelastic fluid-particle suspensions in fully three-dimensional geometries. On the other hand, the need for such a tool in engineering applications is immense. Suspensions of rigid particles in viscoelastic fluids play key roles in many energy applications. For example, in oil drilling the ``drilling mud'' is a very viscous, viscoelastic fluid designed to shear-thin during drilling, but thicken at stoppage so that the ``cuttings'' can remain suspended. In a related application known as hydraulic fracturing suspensions of solids called ``proppant'' are used to prop open the fracture by pumping them into the well. It is well-known that particle flow and settling in a viscoelastic fluid can be quite different from that which is observed in Newtonian fluids. First, it is now well known that the ``fluid particle split'' at bifurcation cracks is controlled by fluid rheology in a manner that is not understood. Second, in Newtonian fluids, the presence of an imposed shear flow in the direction perpendicular to gravity (which we term a cross or orthogonal shear flow) has no effect on the settling of a spherical particle in Stokes flow (i.e. at vanishingly small Reynolds number). By contrast, in a non-Newtonian liquid, the complex rheological properties induce a nonlinear coupling between the sedimentation and shear flow. Recent experimental data have shown both the shear thinning and the elasticity of the suspending polymeric solutions significantly affects the fluid-particle split at bifurcations, as well as the settling rate of the solids. In the present work, we use the Immersed Boundary Method to develop computer simulations of viscoelastic flow in suspensions of spheres to study these problems. These simulations allow us to understand the detailed physical mechanisms for the remarkable physical behavior seen in practice, and actually suggest design rules for creating new fluid recipes.

We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.

A novel methodology for delineating multiple reservoir domains for the purpose of history matching in a distributed computing environment has been proposed. A fully probabilistic approach to perturb permeability within the delineated zones is implemented. The combination of robust schemes for identifying reservoir zones and distributed computing significantly increase the accuracy and efficiency of the probabilistic approach. The information pertaining to the permeability variations in the reservoir that is contained in dynamic data is calibrated in terms of a deformation parameter rD. This information is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, well configuration, flow constrains etc. The probabilistic approach then has to account for multiple r{sub D} values in different regions of the reservoir. In order to delineate reservoir domains that can be characterized with different rD parameters, principal component analysis (PCA) of the Hessian matrix has been done. The Hessian matrix summarizes the sensitivity of the objective function at a given step of the history matching to model parameters. It also measures the interaction of the parameters in affecting the objective function. The basic premise of PC analysis is to isolate the most sensitive and least correlated regions. The eigenvectors obtained during the PCA are suitably scaled and appropriate grid block volume cut-offs are defined such that the resultant domains are neither too large (which increases interactions between domains) nor too small (implying ineffective history matching). The delineation of domains requires calculation of Hessian, which could be computationally costly and as well as restricts the current approach to

This paper presents the application of distributed memory parallelcomputes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. A portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented

In this paper,the state-of-the-art parallelcomputational model research is reviewed.We will introduce various models that were developed during the past decades.According to their targeting architecture features,especially memory organization,we classify these parallelcomputational models into three generations.These models and their characteristics are discussed based on three generations classification.We believe that with the ever increasing speed gap between the CPU and memory systems,incorporating non-uniform memory hierarchy into computational models will become unavoidable.With the emergence of multi-core CPUs,the parallelism hierarchy of current computing platforms becomes more and more complicated.Describing this complicated parallelism hierarchy in future computational models becomes more and more important.A semi-automatic toolkit that can extract model parameters and their values on real computers can reduce the model analysis complexity,thus allowing more complicated models with more parameters to be adopted.Hierarchical memory and hierarchical parallelism will be two very important features that should be considered in future model design and research.

This volume presents the results of ComputationalFluid Dynamics (CFD) analysis that can be used for conceptual studies of product design, detail product development, process troubleshooting. It demonstrates the benefit of CFD modeling as a cost saving, timely, safe and easy to scale-up methodology.

Methods, parallelcomputers, and products are provided for identifying failure in a tree network of a parallelcomputer. The parallelcomputer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.

It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....

Full Text Available The process of high gradient magnetic separation (HGMS using a microferromagnetic wire for capturing weakly magnetic nanoparticles in the irrotational flow of inviscid fluid is simulated by using parallel algorithm developed based on openMP. The two-dimensional problem of particle transport under the influences of magnetic force and fluid flow is considered in an annular domain surrounding the wire with inner radius equal to that of the wire and outer radius equal to various multiples of wire radius. The differential equations governing particle transport are solved numerically as an initial and boundary values problem by using the finite-difference method. Concentration distribution of the particles around the wire is investigated and compared with some previously reported results and shows the good agreement between them. The results show the feasibility of accumulating weakly magnetic nanoparticles in specific regions on the wire surface which is useful for applications in biomedical and environmental works. The speedup of parallel simulation ranges from 1.8 to 21 depending on the number of threads and the domain problem size as well as the number of iterations. With the nature of computing in the application and current multicore technology, it is observed that 4–8 threads are sufficient to obtain the optimized speedup.

Parallel programming covers task-parallelism and data-parallelism. Many problems need both parallelisms. Multi-SIMD computers allow hierarchical approach of these parallelisms. The T++ language, based on C++, is dedicated to exploit Multi-SIMD computers using a programming paradigm which is an extension of array-programming to tasks managing. Our language introduced array of independent tasks to achieve separately (MIMD), on subsets of processors of identical behaviour (SIMD), in order to translate the hierarchical inclusion of data-parallelism in task-parallelism. To manipulate in a symmetrical way tasks and data we propose meta-operations which have the same behaviour on tasks arrays and on data arrays. We explain how to implement this language on our parallelcomputer SYMPHONIE in order to profit by the locally-shared memory, by the hardware virtualization, and by the multiplicity of communications networks. We analyse simultaneously a typical application of such architecture. Finite elements scheme for Fluid mechanic needs powerful parallelcomputers and requires large floating points abilities. Lattice gases is an alternative to such simulations. Boolean lattice bases are simple, stable, modular, need to floating point computation, but include numerical noise. Boltzmann lattice gases present large precision of computation, but needs floating points and are only locally stable. We propose a new scheme, called multi-bit, who keeps the advantages of each boolean model to which it is applied, with large numerical precision and reduced noise. Experiments on viscosity, physical behaviour, noise reduction and spurious invariants are shown and implementation techniques for parallel Multi-SIMD computers detailed. (author) [fr

ComputationalFluid Dynamics: A Practical Approach, Third Edition, is an introduction to CFD fundamentals and commercial CFD software to solve engineering problems. The book is designed for a wide variety of engineering students new to CFD, and for practicing engineers learning CFD for the first time. Combining an appropriate level of mathematical background, worked examples, computer screen shots, and step-by-step processes, this book walks the reader through modeling and computing, as well as interpreting CFD results. This new edition has been updated throughout, with new content and improved figures, examples and problems.

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems. 6 refs

This paper describes a file system design for massively parallelcomputers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.

High-order resolution Discontinuous Galerkin finite element methods (DGFEM) has been known as a good method for solving Euler equations and Navier-Stokes equations on unstructured grid, but it costs too much computational resources. An efficient parallel algorithm was presented for solving the compressible Euler equations. Moreover, the multigrid strategy based on three-stage three-order TVD Runge-Kutta scheme was used in order to improve the computational efficiency of DGFEM and accelerate the convergence of the solution of unsteady compressible Euler equations. In order to make each processor maintain load balancing, the domain decomposition method was employed. Numerical experiment performed for the inviscid transonic flow fluid problems around NACA0012 airfoil and M6 wing. The results indicated that our parallel algorithm can improve acceleration and efficiency significantly, which is suitable for calculating the complex flow fluid.

A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.

The objective of this project is to purchase a state-of-the-art graphics supercomputer to improve the ComputationalFluid Dynamics (CFD) research capability at Alabama A & M University (AAMU) and to support the Air Force research projects. A cutting-edge graphics supercomputer system, Onyx VTX, from Silicon Graphics Computer Systems (SGI), was purchased and installed. Other equipment including a desktop personal computer, PC-486 DX2 with a built-in 10-BaseT Ethernet card, a 10-BaseT hub, an Apple Laser Printer Select 360, and a notebook computer from Zenith were also purchased. A reading room has been converted to a research computer lab by adding some furniture and an air conditioning unit in order to provide an appropriate working environments for researchers and the purchase equipment. All the purchased equipment were successfully installed and are fully functional. Several research projects, including two existing Air Force projects, are being performed using these facilities.

Recent advances in developing numerical algorithms for solving fluid flow problems, and the continuing improvement in the speed and storage of large scale computers have made it feasible to compute the flow field about complex and realistic configurations. Current solution methods involve the use of a hierarchy of mathematical models ranging from the linearized potential equation to the Navier Stokes equations. Because of the increasing complexity of both the geometries and flowfields encountered in practical fluid flow simulation, there is a growing emphasis in computationalfluid dynamics on the use of zonal methods. A zonal method is one that subdivides the total flow region into interconnected smaller regions or zones. The flow solutions in these zones are then patched together to establish the global flow field solution. Zonal methods are primarily used either to limit the complexity of the governing flow equations to a localized region or to alleviate the grid generation problems about geometrically complex and multicomponent configurations. This paper surveys the application of zonal methods for solving the flow field about two and three-dimensional configurations. Various factors affecting their accuracy and ease of implementation are also discussed. From the presented review it is concluded that zonal methods promise to be very effective for computing complex flowfields and configurations. Currently there are increasing efforts to improve their efficiency, versatility, and accuracy

1 - Description of program or function: CUBESIM is a set of subroutine libraries and programs for the simulation of message-passing parallelcomputers and shared-memory parallelcomputers. Subroutines are supplied to simulate the Intel hypercube and the Denelcor HEP parallelcomputers. The system permits a user to develop and test parallel programs written in C or FORTRAN on a single processor. The user may alter such hypercube parameters as message startup times, packet size, and the computation-to-communication ratio. The simulation generates a trace file that can be used for debugging, performance analysis, or graphical display. 2 - Method of solution: The CUBESIM simulator is linked with the user's parallel application routines to run as a single UNIX process. The simulator library provides a small operating system to perform process and message management. 3 - Restrictions on the complexity of the problem: Up to 128 processors can be simulated with a virtual memory limit of 6 million bytes. Up to 1000 processes can be simulated

Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallelcomputer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.

Scientific computing is an inherently exploratory activity that requires constantly cycling between code, data and results, each time adjusting the computations as new insights and questions arise. To support such a workflow, good interactive environments are critical. The IPython project (http://ipython.org) provides a rich architecture for interactive computing with: 1. Terminal-based and graphical interactive consoles. 2. A web-based Notebook system with support for code, text, mathematical expressions, inline plots and other rich media. 3. Easy to use, high performance tools for parallelcomputing. Despite its roots in Python, the IPython architecture is designed in a language-agnostic way to facilitate interactive computing in any language. This allows users to mix Python with Julia, R, Octave, Ruby, Perl, Bash and more, as well as to develop native clients in other languages that reuse the IPython clients. In this talk, I will show how IPython supports all stages in the lifecycle of a scientific idea: 1. Individual exploration. 2. Collaborative development. 3. Production runs with parallel resources. 4. Publication. 5. Education. In particular, the IPython Notebook provides an environment for "literate computing" with a tight integration of narrative and computation (including parallelcomputing). These Notebooks are stored in a JSON-based document format that provides an "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.

Techniques are provided for small file aggregation in a parallelcomputing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallelcomputing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.

In this paper, we present a fast numerical algorithm for solving total variation and l(1) (TVL1) based image reconstruction with application in partially parallel magnetic resonance imaging. Our algorithm uses variable splitting method to reduce computational cost. Moreover, the Barzilai-Borwein step size selection method is adopted in our algorithm for much faster convergence. Experimental results on clinical partially parallel imaging data demonstrate that the proposed algorithm requires much fewer iterations and/or less computational cost than recently developed operator splitting and Bregman operator splitting methods, which can deal with a general sensing matrix in reconstruction framework, to get similar or even better quality of reconstructed images.

This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).

A design of a parallel aeroelastic code for aircraft integrated simulations is conducted. The method for integrating aerodynamics and structural dynamics software on parallelcomputers is devised by using the Euler/Navier-Stokes equations coupled with wing-box finite element structures. A synthesis of modern aircraft requires the optimizations of aerodynamics, structures, controls, operabilities, or other design disciplines, and the R and D efforts to implement Multidisciplinary Design Optimization environments using high performance computers are made especially among the U.S. aerospace industries. This report describes a Multiple Program Multiple Data (MPMD) parallelization of aerodynamics and structural dynamics codes with a dynamic deformation grid. A three-dimensional computation of a flowfield with dynamic deformation caused by a structural deformation is performed, and a pressure data calculated is used for a computation of the structural deformation which is input again to a fluid dynamics code. This process is repeated exchanging the computed data of pressures and deformations between flowfield grids and structural elements. It enables to simulate the structure movements which take into account of the interaction of fluid and structure. The conceptual design for achieving the aforementioned various functions is reported. Also the future extensions to incorporate control systems, which enable to simulate a realistic aircraft configuration to be a major tool for Aircraft Integrated Simulation, are investigated. (author)

CFD-calculations have been rapidly developed to a powerful tool for the analysis of air pollution distribution in various spaces. However, the user of CFD-calculation should be aware of the basic principles of calculations and specifically the boundary conditions. ComputationalFluid Dynamics (CFD) – in Ventilation Design models is written by a working group of highly qualified international experts representing research, consulting and design.

The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallelcomputing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallelcomputing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained parallelingcomputing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.

Parallelcomputing meets the ever-increasing requirements for neutronic computer code speed and accuracy. In this work, two different approaches have been considered. We first parallelized the sequential algorithm used by the neutronics code CRONOS developed at the French Atomic Energy Commission. The algorithm computes the dominant eigenvalue associated with PN simplified transport equations by a mixed finite element method. Several parallel algorithms have been developed on distributed memory machines. The performances of the parallel algorithms have been studied experimentally by implementation on a T3D Cray and theoretically by complexity models. A comparison of various parallel algorithms has confirmed the chosen implementations. We next applied a domain sub-division technique to the two-group diffusion Eigen problem. In the modal synthesis-based method, the global spectrum is determined from the partial spectra associated with sub-domains. Then the Eigen problem is expanded on a family composed, on the one hand, from eigenfunctions associated with the sub-domains and, on the other hand, from functions corresponding to the contribution from the interface between the sub-domains. For a 2-D homogeneous core, this modal method has been validated and its accuracy has been measured. (author)

Often very fundamental biochemical and biophysical problems defy simulations because of limitations in today's computers. We present and discuss a distributed system composed of two IBM 4341 s and/or an IBM 4381 as front-end processors and ten FPS-164 attached array processors. This parallel system - called LCAP - has presently a peak performance of about 110 Mflops; extensions to higher performance are discussed. Presently, the system applications use a modified version of VM/SP as the operating system: description of the modifications is given. Three applications programs have been migrated from sequential to parallel: a molecular quantum mechanical, a Metropolis-Monte Carlo and a molecular dynamics program. Descriptions of the parallel codes are briefly outlined. Use of these parallel codes has already opened up new capabilities for our research. The very positive performance comparisons with today's supercomputers allow us to conclude that parallelcomputers and programming, of the type we have considered, represent a pragmatic answer to many computationally intensive problems. (orig.)

Full Text Available The edit distance between two sequences is the minimum number of weighted transformation-operations that are required to transform one string into the other. The weighted transformation-operations are insert, remove, and substitute. Dynamic programming solution to find edit distance exists but it becomes computationally intensive when the lengths of strings become very large. This work presents a novel parallel algorithm to solve edit distance problem of string matching. The algorithm is based on resolving dependencies in the dynamic programming solution of the problem and it is able to compute each row of edit distance table in parallel. In this way, it becomes possible to compute the complete table in min(m,n iterations for strings of size m and n whereas state-of-the-art parallel algorithm solves the problem in max(m,n iterations. The proposed algorithm also increases the amount of parallelism in each of its iteration. The algorithm is also capable of exploiting spatial locality while its implementation. Additionally, the algorithm works in a load balanced way that further improves its performance. The algorithm is implemented for multicore systems having shared memory. Implementation of the algorithm in OpenMP shows linear speedup and better execution time as compared to state-of-the-art parallel approach. Efficiency of the algorithm is also proven better in comparison to its competitor.

We study the potential performance of multigrid algorithms running on massively parallelcomputers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallelcomputation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.

With current technologies, it seems to be very difficult to implement quantum computers with many qubits. It is therefore of importance to simulate quantum algorithms and circuits on the existing computers. However, for a large-size problem, the simulation often requires more computational power than is available from sequential processing. Therefore, simulation methods for parallel processors are required. We have developed a general-purpose simulator for quantum algorithms/circuits on the parallelcomputer (Sun Enterprise4500). It can simulate algorithms/circuits with up to 30 qubits. In order to test efficiency of our proposed methods, we have simulated Shor's factorization algorithm and Grover's database search, and we have analyzed robustness of the corresponding quantum circuits in the presence of both decoherence and operational errors. The corresponding results, statistics, and analyses are presented in this paper

The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.

Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.

A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallelcomputations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.

Aimed to evaluate the image processing capabilities of the massively parallelcomputer Quadrics Q1, a convolution algorithm that has been implemented is described in this report. At first the discrete convolution mathematical definition is recalled together with the main Q1 h/w and s/w features. Then the different codification forms of the algorythm are described and the Q1 performances are compared with those obtained by different computers. Finally, the conclusions report on main results and suggestions

Full Text Available Data mining is a technology that can extract useful information from large amounts of data. However, mining a database often requires a high computational power. To resolve this problem, this paper presents a tool (Weka-G, which runs in parallel algorithms used in the mining process data. As the environment for doing so, we use a computational grid by adding several features within a WAN.

Method and system for hardware packet pacing using a direct memory access controller in a parallelcomputer which, in one aspect, keeps track of a total number of bytes put on the network as a result of a remote get operation, using a hardware token counter.

In this article, we discuss a parallel implementation of efficient algorithms for computation of Legendre polynomial transforms and other orthogonal polynomial transforms. We develop an approach to the Driscoll-Healy algorithm using polynomial arithmetic and present experimental results on the

In this article we discuss a parallel implementation of efficient algorithms for computation of Legendre polynomial transforms and other orthogonal polynomial transforms. We develop an approach to the Driscoll-Healy algorithm using polynomial arithmetic and present experimental results on the

The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems

Applications for computers which improve instruction in fluid dynamics are examined. Computers can be used to illustrate three-dimensional flow fields and simple fluid dynamics mechanisms, to solve fluid dynamics problems, and for electronic sketching. The usefulness of computer applications is limited by computer speed, memory, and software and the clarity and field of view of the projected display. Proposed advances in personal computers which will address these limitations are discussed. Long range applications for computers in education are considered.

The domain of nanoscale optical science and technology is a combination of the classical world of electromagnetics and the quantum mechanical regime of atoms and molecules. Recent advancements in fabrication technology allows the optical structures to be scaled down to nanoscale size or even to the atomic level, which are far smaller than the wavelength they are designed for. These nanostructures can have unique, controllable, and tunable optical properties and their interactions with quantum materials can have important near-field and far-field optical response. Undoubtedly, these optical properties can have many important applications, ranging from the efficient and tunable light sources, detectors, filters, modulators, high-speed all-optical switches; to the next-generation classical and quantum computation, and biophotonic medical sensors. This emerging research of nanoscience, known as nanophotonics, is a highly interdisciplinary field requiring expertise in materials science, physics, electrical engineering, and scientific computing, modeling and simulation. It has also become an important research field for investigating the science and engineering of light-matter interactions that take place on wavelength and subwavelength scales where the nature of the nanostructured matter controls the interactions. In addition, the fast advancements in the computing capabilities, such as parallelcomputing, also become as a critical element for investigating advanced nanophotonic devices. This role has taken on even greater urgency with the scale-down of device dimensions, and the design for these devices require extensive memory and extremely long core hours. Thus distributed computing platforms associated with parallelcomputing are required for faster designs processes. Scientific parallelcomputing constructs mathematical models and quantitative analysis techniques, and uses the computing machines to analyze and solve otherwise intractable scientific challenges. In

In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallelcomputations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f m pi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel

The general purpose Monte Carlo code MCNP4 has been implemented on the Fujitsu AP1000 distributed memory highly parallelcomputer. Parallelization techniques developed and studied are reported. A shielding analysis function of the MCNP4 code is parallelized in this study. A technique to map a history to each processor dynamically and to map control process to a certain processor was applied. The efficiency of parallelized code is up to 80% for a typical practical problem with 512 processors. These results demonstrate the advantages of a highly parallelcomputer to the conventional computers in the field of shielding analysis by Monte Carlo method. (orig.)

A general-purpose 3-D, incompressible Navier-Stokes algorithm is implemented on a network of concurrently operating workstations using parallel virtual machine (PVM) and compared with its performance on a CRAY Y-MP and on an Intel iPSC/860. The problem is relatively computationally intensive, and has a communication structure based primarily on nearest-neighbor communication, making it ideally suited to message passing. Such problems are frequently encountered in computationalfluid dynamics (CDF), and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in message passing in order to produce a near-optimal solution. Results are presented for various grid sizes using up to eight processors.

As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.

of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds...... of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green's and lesser Green's function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only...... require computing a small number of entries of the inverse matrix. Then. we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size....

Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.

Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallelcomputing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.

Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallelcomputer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallelcomputer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallelcomputer.

Using the PC cluster system with 16 dual CPU nodes, we accelerate the FBP and OR-OSEM reconstruction of high spatial resolution image (2048 x 2048). Based on the number of projections, we rewrite the reconstruction algorithms into parallel format and dispatch the tasks to each CPU. By parallelcomputing, the speedup factor is roughly equal to the number of CPUs, which can be up to about 25 times when 25 CPUs used. This technique is very suitable for real-time high spatial resolution CT image reconstruction. (authors)

This document summarizes research performed under the SNL LDRD entitled - Computational Mechanics for Geosystems Management to Support the Energy and Natural Resources Mission. The main accomplishment was development of a foundational SNL capability for computational thermal, chemical, fluid, and solid mechanics analysis of geosystems. The code was developed within the SNL Sierra software system. This report summarizes the capabilities of the simulation code and the supporting research and development conducted under this LDRD. The main goal of this project was the development of a foundational capability for coupled thermal, hydrological, mechanical, chemical (THMC) simulation of heterogeneous geosystems utilizing massively parallel processing. To solve these complex issues, this project integrated research in numerical mathematics and algorithms for chemically reactive multiphase systems with computer science research in adaptive coupled solution control and framework architecture. This report summarizes and demonstrates the capabilities that were developed together with the supporting research underlying the models. Key accomplishments are: (1) General capability for modeling nonisothermal, multiphase, multicomponent flow in heterogeneous porous geologic materials; (2) General capability to model multiphase reactive transport of species in heterogeneous porous media; (3) Constitutive models for describing real, general geomaterials under multiphase conditions utilizing laboratory data; (4) General capability to couple nonisothermal reactive flow with geomechanics (THMC); (5) Phase behavior thermodynamics for the CO2-H2O-NaCl system. General implementation enables modeling of other fluid mixtures. Adaptive look-up tables enable thermodynamic capability to other simulators; (6) Capability for statistical modeling of heterogeneity in geologic materials; and (7) Simulator utilizes unstructured grids on parallel processing computers.

The smoothed particle hydrodynamics (SPH), which is a class of meshfree particle methods (MPMs), has a wide range of applications from micro-scale to macro-scale as well as from discrete systems to continuum systems. Graphics hardware, originally designed for computer graphics, now provide unprecedented computational power for scientific computation. Particle system needs a huge amount of computations in physical simulation. In this paper, an efficient parallel implementation of a SPH method on graphics hardware using the Compute Unified Device Architecture is developed for fluid simulation. Comparing to the corresponding CPU implementation, our experimental results show that the new approach allows significant speedups of fluid simulation through handling huge amount of computations in parallel on graphics hardware.

Nuclear plants use steam generator safety valves in order to regulate possible large pressure variations of fluids. In case of an incident these valves may be fed with pressurized liquid water (for instance a pressure of 9 MPa at a temperature of 300degC). When a pressurized liquid is submitted to a strong pressure drop, it will start evaporating. This phenomena is called flashing. Z. Bilicki and co-authors proposed the homogeneous relaxation model (HRM) to compute critical flashing water flows. Its computation in the case of non stationary one-dimensional flashing flows has been carried out with the development of a dedicated time dependent Finite Volume scheme based on a simplified version of the Godunov approach. Electricite De France Research and Development division have developed a monodimensional implementation of the HRM model: ECOSS, a 11000 lines FORTRAN 90. Applied to a shock tube test case with a 20000 elements monodimensional mesh, the simulation of the physical phenomenon during 2.5 seconds requires at least 100 days of computation on a SUN Sparc-Ultra60. This execution time justifies the ECOSS parallelization. Furthermore, we plan a modeling on 2D meshes for the next few years. Knowing that multiplying the mesh dimension by a factor 10 multiplies the execution time by a factor 100, ECOSS would take years of computation with small 2D meshes (1000 x 1000) on a conventional workstation. This paper describes the parallelization analysis we have conducted and we presents the experimental results we have obtained applying different programming model (MPI, OpenMP, HPF) on various platforms (a Compaq Proliant 6000 4 processors, a Cray T3E-750 300 processors, a HP class V 16 processors, a SGI Origin2000 32 processors, a cluster of PCs and a COMPAQ SC 232 processors). These experimental results will be discussed according to the following criteria: efficiency, salability, maintainability, developing costs and portability. As a conclusion, we will present the

Nuclear plants use steam generator safety valves in order to regulate possible large pressure variations of fluids. In case of an incident these valves may be fed with pressurized liquid water (for instance a pressure of 9 MPa at a temperature of 300degC). When a pressurized liquid is submitted to a strong pressure drop, it will start evaporating. This phenomena is called flashing. Z. Bilicki and co-authors proposed the homogeneous relaxation model (HRM) to compute critical flashing water flows. Its computation in the case of non stationary one-dimensional flashing flows has been carried out with the development of a dedicated time dependent Finite Volume scheme based on a simplified version of the Godunov approach. Electricite De France Research and Development division have developed a monodimensional implementation of the HRM model: ECOSS, a 11000 lines FORTRAN 90. Applied to a shock tube test case with a 20000 elements monodimensional mesh, the simulation of the physical phenomenon during 2.5 seconds requires at least 100 days of computation on a SUN Sparc-Ultra60. This execution time justifies the ECOSS parallelization. Furthermore, we plan a modeling on 2D meshes for the next few years. Knowing that multiplying the mesh dimension by a factor 10 multiplies the execution time by a factor 100, ECOSS would take years of computation with small 2D meshes (1000 x 1000) on a conventional workstation. This paper describes the parallelization analysis we have conducted and we presents the experimental results we have obtained applying different programming model (MPI, OpenMP, HPF) on various platforms (a Compaq Proliant 6000 4 processors, a Cray T3E-750 300 processors, a HP class V 16 processors, a SGI Origin2000 32 processors, a cluster of PCs and a COMPAQ SC 232 processors). These experimental results will be discussed according to the following criteria: efficiency, salability, maintainability, developing costs and portability. As a conclusion, we will present the

A discrete ordinate response matrix method is formulated for the solution of neutron transport problems on massively parallelcomputers. The response matrix formulation eliminates iteration on the scattering source. The nodal matrices which result from the diamond-differenced equations are utilized in a factored form which minimizes memory requirements and significantly reduces the required number of algorithm utilizes massive parallelism by assigning each spatial node to a processor. The algorithm is accelerated effectively by a synthetic method in which the low-order diffusion equations are also solved by massively parallel red/black iterations. The method has been implemented on a 16k Connection Machine-2, and S 8 and S 16 solutions have been obtained for fixed-source benchmark problems in X--Y geometry

The event selection requirements of contemporary high energy and nuclear physics experiments are met by the introduction of on-line trigger algorithms which identify potentially interesting events and reduce the data acquisition rate to levels that are manageable by the electronics. Such algorithms being parallel in nature can be simulated off-line using massively parallelcomputers. The PHENIX experiment intends to investigate the possible existence of a new phase of matter called the quark gluon plasma which has been theorized to have existed in very early stages of the evolution of the universe by studying collisions of heavy nuclei at ultra-relativistic energies. Such interactions can also reveal important information regarding the structure of the nucleus and mandate a thorough investigation of the simpler proton-nucleus collisions at the same energies. The complexity of PHENIX events and the need to analyze and also simulate them at rates similar to the data collection ones imposes enormous computation demands. This work is a first effort to implement PHENIX trigger algorithms on parallelcomputers and to study the feasibility of using such machines to run the complex programs necessary for the simulation of the PHENIX detector response. Fine and coarse grain approaches have been studied and evaluated. Depending on the application the performance of a massively parallelcomputer can be much better or much worse than that of a serial workstation. A comparison between single instruction and multiple instruction computers is also made and possible applications of the single instruction machines to high energy and nuclear physics experiments are outlined. copyright 1995 American Institute of Physics

New developments in Computer Science, both hardware and software, offer researchers, such as physicists, unprecedented possibilities to solve their computational intensive problems.However, full exploitation of e.g. new massively parallelcomputers, parallel languages or runtime environments

In this paper, a parallel model for the solution of the computationally intensive multichannel phase unwrapping (MCh-PhU) problem is proposed. Firstly, the Extended Minimum Cost Flow (EMCF) algorithm for solving MCh-PhU problem is revised within the rigorous mathematical framework of the discrete calculus ; thus permitting to capture its topological structure in terms of meaningful discrete differential operators. Secondly, emphasis is placed on those methodological and practical aspects, which lead to a parallel reformulation of the EMCF algorithm. Thus, a novel dual-level parallelcomputational model, in which the parallelism is hierarchically implemented at two different (i.e., process and thread) levels, is presented. The validity of our approach has been demonstrated through a series of experiments that have revealed a significant speedup. Therefore, the attained high-performance prototype is suitable for the solution of large-scale phase unwrapping problems in reasonable time frames, with a significant impact on the systematic exploitation of the existing, and rapidly growing, large archives of SAR data.

Methods, apparatus, and computer program products are disclosed for executing a gather operation on a parallelcomputer according to embodiments of the present invention. Embodiments include configuring, by the logical root, a result buffer or the logical root, the result buffer having positions, each position corresponding to a ranked node in the operational group and for storing contribution data gathered from that ranked node. Embodiments also include repeatedly for each position in the result buffer: determining, by each compute node of an operational group, whether the current position in the result buffer corresponds with the rank of the compute node, if the current position in the result buffer corresponds with the rank of the compute node, contributing, by that compute node, the compute node's contribution data, if the current position in the result buffer does not correspond with the rank of the compute node, contributing, by that compute node, a value of zero for the contribution data, and storing, by the logical root in the current position in the result buffer, results of a bitwise OR operation of all the contribution data by all compute nodes of the operational group for the current position, the results received through the global combining network.

Full Text Available In digital image processing and computer vision, a fairly frequent task is the performance comparison of different algorithms on enormous image databases. This task is usually time-consuming and tedious, such that any kind of tool to simplify this work is welcome. To achieve an efficient and more practical handling of a normally tedious evaluation, we implemented the automatic detection system, with the help of MATLAB®’s ParallelComputing Toolbox™. The key parts of the system have been parallelized to achieve simultaneous execution and analysis of segmentation algorithms on the one hand and the evaluation of detection accuracy for the nonforested regions, such as a study case, on the other hand. As a positive side effect, CPU usage was reduced and processing time was significantly decreased by 68.54% compared to sequential processing (i.e., executing the system with each algorithm one by one.

As part of the Center for Programming Models for Scalable ParallelComputing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

A parallelcomputer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.

A parallelcomputer executes a number of tasks, each task includes a number of endpoints and the endpoints are configured to support collective operations. In such a parallelcomputer, establishing a group of endpoints receiving a user specification of a set of endpoints included in a global collection of endpoints, where the user specification defines the set in accordance with a predefined virtual representation of the endpoints, the predefined virtual representation is a data structure setting forth an organization of tasks and endpoints included in the global collection of endpoints and the user specification defines the set of endpoints without a user specification of a particular endpoint; and defining a group of endpoints in dependence upon the predefined virtual representation of the endpoints and the user specification.

Full Text Available With high speed and accuracy the parallel manipulators have wide application in the industry, but there still exist many difficulties in the actual control process because of the time-varying and coupling. Unfortunately, the present-day commercial controlles cannot provide satisfying performance for its single axis linear control only. Therefore, aimed at a novel 2-DOF (Degree of Freedom parallel manipulator called Diamond 600, a motor-mechanism coupling dynamic model based control scheme employing the computed torque control algorithm are presented in this paper. First, the integrated dynamic coupling model is deduced, according to equivalent torques between the mechanical structure and the PM (Permanent Magnetism servomotor. Second, computed torque controller is described in detail for the above proposed model. At last, a series of numerical simulations and experiments are carried out to test the effectiveness of the system, and the results verify the favourable tracking ability and robustness.

Full Text Available With high speed and accuracy the parallel manipulators have wide application in the industry, but there still exist many difficulties in the actual control process because of the time-varying and coupling. Unfortunately, the present-day commercial controlles cannot provide satisfying performance for its single axis linear control only. Therefore, aimed at a novel 2-DOF (Degree of Freedom parallel manipulator called Diamond 600, a motor-mechanism coupling dynamic model based control scheme employing the computed torque control algorithm are presented in this paper. First, the integrated dynamic coupling model is deduced, according to equivalent torques between the mechanical structure and the PM (Permanent Magnetism servomotor. Second, computed torque controller is described in detail for the above proposed model. At last, a series of numerical simulations and experiments are carried out to test the effectiveness of the system, and the results verify the favourable tracking ability and robustness.

Lattice Boltzmann Method (LBM) has advanced as a class of ComputationalFluid Dynamics (CFD) methods used to solve complex fluid systems and heat transfer problems. It has ever-increasingly attracted the interest of researchers in computational physics to solve challenging problems of industrial and academic importance. In this current study, LBM is applied to simulate the creeping fluid flow phenomena commonly encountered in manufacturing technologies. In particular, we apply this novel method to simulate the fluid flow phenomena associated with the "meniscus roll coating" application. This prevalent industrial problem encountered in polymer processing and thin film coating applications is modelled as standard lid-driven cavity problem to which creeping flow analysis is applied. This incompressible viscous flow problem is studied in various speed ratios, the ratio of upper to lower lid speed in two different configurations of lid movement - parallel and anti-parallel wall motion. The flow exhibits interesting patterns which will help in design of roll coaters.

This paper is based on the new REHVA Guidebook ComputationalFluid Dynamics in Ventilation Design (Nielsen et al. 2007) written by Peter V. Nielsen, Francis(Nielsen 2007) written by Peter V. Nielsen, Francis Allard, Hazim B. Awbi, Lars Davidson and Alois Schälin. The guidebook is made for people....... The guidebook introduces rules for good quality prediction work, and it is the purpose of the guidebook to improve the technical level of CFD work in ventilation.......This paper is based on the new REHVA Guidebook ComputationalFluid Dynamics in Ventilation Design (Nielsen et al. 2007) written by Peter V. Nielsen, Francis(Nielsen 2007) written by Peter V. Nielsen, Francis Allard, Hazim B. Awbi, Lars Davidson and Alois Schälin. The guidebook is made for people...... who need to use and discuss results based on CFD predictions, and it gives insight into the subject for those who are not used to work with CFD. The guidebook is also written for people working with CFD who have to be more aware of how this numerical method is applied in the area of ventilation...

We developed a computer noise simulation model for cone beam computed tomography imaging using a general purpose PC cluster. This model uses a mono-energetic x-ray approximation and allows us to investigate three primary performance components, specifically quantum noise, detector blurring and additive system noise. A parallel random number generator based on the Weyl sequence was implemented in the noise simulation and a visualization technique was accordingly developed to validate the quality of the parallel random number generator. In our computer simulation model, three-dimensional (3D) phantoms were mathematically modelled and used to create 450 analytical projections, which were then sampled into digital image data. Quantum noise was simulated and added to the analytical projection image data, which were then filtered to incorporate flat panel detector blurring. Additive system noise was generated and added to form the final projection images. The Feldkamp algorithm was implemented and used to reconstruct the 3D images of the phantoms. A 24 dual-Xeon PC cluster was used to compute the projections and reconstructed images in parallel with each CPU processing 10 projection views for a total of 450 views. Based on this computer simulation system, simulated cone beam CT images were generated for various phantoms and technique settings. Noise power spectra for the flat panel x-ray detector and reconstructed images were then computed to characterize the noise properties. As an example among the potential applications of our noise simulation model, we showed that images of low contrast objects can be produced and used for image quality evaluation

A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.

In this paper a parallel implementation of a watershed algorithm is proposed. The algorithm can easily be implemented on shared memory parallelcomputers. The watershed transform is generally considered to be inherently sequential since the discrete watershed of an image is defined using recursion.

This paper describes the present and expected future development of distributed memory parallelcomputers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.

This paper describes the present and expected future development of distributed memory parallelcomputers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. (orig.)

This paper describes the present and expected future development of distributed memory parallelcomputers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.

Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallelcomputer. Each compute node includes at least two processing cores. Each processing core has contribution data for the allreduce operation. Performing an allreduce operation on a plurality of compute nodes of a parallelcomputer includes: establishing one or more logical rings among the compute nodes, each logical ring including at least one processing core from each compute node; performing, for each logical ring, a global allreduce operation using the contribution data for the processing cores included in that logical ring, yielding a global allreduce result for each processing core included in that logical ring; and performing, for each compute node, a local allreduce operation using the global allreduce results for each processing core on that compute node.

This dissertation deals with the architecture of parallelcomputers dedicated to computer vision. In the first chapter, the problem to be solved is presented, as well as the architecture of the Sympati and Symphonie computers, on which this work is based. The second chapter is about the state of the art of computers and integrated processors that can execute computer vision and image processing codes. The third chapter contains a description of the architecture of Centaure. It has an heterogeneous structure: it is composed of a multiprocessor system based on Analog Devices ADSP21060 Sharc digital signal processor, and of a set of Symphonie computers working in a multi-SIMD fashion. Centaure also has a modular structure. Its basic node is composed of one Symphonie computer, tightly coupled to a Sharc thanks to a dual ported memory. The nodes of Centaure are linked together by the Sharc communication links. The last chapter deals with a performance validation of Centaure. The execution times on Symphonie and on Centaure of a benchmark which is typical of industrial vision, are presented and compared. In the first place, these results show that the basic node of Centaure allows a faster execution than Symphonie, and that increasing the size of the tested computer leads to a better speed-up with Centaure than with Symphonie. In the second place, these results validate the choice of running the low level structure of Centaure in a multi- SIMD fashion. (author) [fr

Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.

The aim of dynamic contingency calculations in power systems is to estimate the effects of assumed disturbances, such as loss of generation. Due to the large dimensions of the problem these simulations require considerable computing time and costs, to the effect that they are at present only used in a planning state but not for routine checks in power control stations. In view of the homogeneity of the problem, where a multitude of equal generator models, having different parameters, are to be integrated simultaneously, the use of a parallelcomputer looks very attractive. The results of this study employing a prototype parallelcomputer (SMS 201) are presented. It consists of up to 128 equal microcomputers bus-connected to a control computer. Each of the modules is programmed to simulate a node of the power grid. Generators with their associated control are represented by models of 13 states each. Passive nodes are complemented by 'phantom'-generators, so that the whole power grid is homogenous, thus removing the need for load-flow-iterations. Programming of microcomputers is essentially performed in FORTRAN.

The fast multipole method (FMM) and smooth particle mesh Ewald (SPME) are well known fast algorithms to evaluate long range electrostatic interactions in molecular dynamics and other fields. FMM is a multi-scale method which reduces the computation cost by approximating the potential due to a group of particles at a large distance using few multipole functions. This algorithm scales like O(N) for N particles. SPME algorithm is an O(NlnN) method which is based on an interpolation of the Fourier space part of the Ewald sum and evaluating the resulting convolutions using fast Fourier transform (FFT). Those algorithms suffer from relatively poor efficiency on large parallel machines especially for mid-size problems around hundreds of thousands of atoms. A variation of the FMM, called PWA, based on plane wave expansions is presented in this paper. A new parallelization strategy for PWA, which takes advantage of the specific form of this expansion, is described. Its parallel efficiency is compared with SPME through detail time measurements on two different computer clusters

A two-dimensional explicit Euler solver has been implemented for five MIMD parallelcomputers of different machine architectures in Center for Promotion of Computational Science and Engineering of Japan Atomic Energy Research Institute. These parallelcomputers are Fujitsu VPP300, NEC SX-4, CRAY T94, IBM SP2, and Hitachi SR2201. The code was parallelized by several parallelization methods, and a typical compressible flow problem has been calculated for different grid sizes changing the number of processors. Their effective performances for parallel calculations, such as calculation speed, speed-up ratio and parallel efficiency, have been investigated and evaluated. The communication time among processors has been also measured and evaluated. As a result, the differences on the performance and the characteristics between vector-parallel and scalar-parallelcomputers can be pointed, and it will present the basic data for efficient use of parallelcomputers and for large scale CFD simulations on parallelcomputers. (author)

1 - Description of program or function: CFDLib05 is the Los Alamos ComputationalFluid Dynamics Library. This is a collection of hydro-codes using a common data structure and a common numerical method, for problems ranging from single-field, incompressible flow, to multi-species, multi-field, compressible flow. The data structure is multi-block, with a so-called structured grid in each block. The numerical method is a Finite-Volume scheme employing a state vector that is fully cell-centered. This means that the integral form of the conversation laws is solved on the physical domain that is represented by a mesh of control volumes. The typical control volume is an arbitrary quadrilateral in 2D and an arbitrary hexahedron in 3D. The Finite-Volume scheme is for time-unsteady flow and remains well coupled by means of time and space centered fluxes; if a steady state solution is required, the problem is integrated forward in time until the user is satisfied that the state is stationary. 2 - Methods: Cells-centered Implicit Continuous-fluid Eulerian (ICE) method

Models in sheet cavitation in cryogenic fluids are developed for use in Euler and Navier-Stokes codes. The models are based upon earlier potential-flow models but enable the cavity inception point, length, and shape to be determined as part of the computation. In the present paper, numerical solutions are compared with experimental measurements for both pressure distribution and cavity length. Comparisons between models are also presented. The CFD model provides a relatively simple modification to an existing code to enable cavitation performance predictions to be included. The analysis also has the added ability of incorporating thermodynamic effects of cryogenic fluids into the analysis. Extensions of the current two-dimensional steady state analysis to three-dimensions and/or time-dependent flows are, in principle, straightforward although geometrical issues become more complicated. Linearized models, however offer promise of providing effective cavitation modeling in three-dimensions. This analysis presents good potential for improved understanding of many phenomena associated with cavity flows.

Paper compares four first-generation artificial-intelligence (Al) software systems for computationalfluid dynamics. Includes: Expert Cooling Fan Design System (EXFAN), PAN AIR Knowledge System (PAKS), grid-adaptation program MITOSIS, and Expert Zonal Grid Generation (EZGrid). Focuses on knowledge-based ("expert") software systems. Analyzes intended tasks, kinds of knowledge possessed, magnitude of effort required to codify knowledge, how quickly constructed, performances, and return on investment. On basis of comparison, concludes Al most successful when applied to well-formulated problems solved by classifying or selecting preenumerated solutions. In contrast, application of Al to poorly understood or poorly formulated problems generally results in long development time and large investment of effort, with no guarantee of success.

An investigation has been conducted regarding the ability of clustered personal computers to improve the performance of executing software simulations for solving engineering problems. The power and utility of personal computers continues to grow exponentially through advances in computing capabilities such as newer microprocessors, advances in microchip technologies, electronic packaging, and cost effective gigabyte-size hard drive capacity. Many engineering problems require significant computing power. Therefore, the computation has to be done by high-performance computer systems that cost millions of dollars and need gigabytes of memory to complete the task. Alternately, it is feasible to provide adequate computing in the form of clustered personal computers. This method cuts the cost and size by linking (clustering) personal computers together across a network. Clusters also have the advantage that they can be used as stand-alone computers when they are not operating as a parallelcomputer. Parallelcomputing software to exploit clusters is available for computer operating systems like Unix, Windows NT, or Linux. This project concentrates on the use of Windows NT, and the Parallel Virtual Machine (PVM) system to solve an engineering dynamics problem in Fortran

Highlights: ► A performance benchmarking exercise is conducted for diesel combustion simulations. ► The reduced chemical mechanism shows its advantages over base and skeletal models. ► High efficiency and great reduction of CPU runtime are achieved through 4-node solver. ► Increasing ISAT memory from 0.1 to 2 GB reduces the CPU runtime by almost 35%. ► Combustion and soot processes are predicted well with minimal computational cost. - Abstract: In the present study, in-cylinder diesel combustion simulation was performed with parallel processing on an Intel Xeon Quad-Core platform to allow both fluid dynamics and chemical kinetics of the surrogate diesel fuel model to be solved simultaneously on multiple processors. Here, Cartesian Z-Coordinate was selected as the most appropriate partitioning algorithm since it computationally bisects the domain such that the dynamic load associated with fuel particle tracking was evenly distributed during parallelcomputations. Other variables examined included number of compute nodes, chemistry sizes and in situ adaptive tabulation (ISAT) parameters. Based on the performance benchmarking test conducted, parallel configuration of 4-compute node was found to reduce the computational runtime most efficiently whereby a parallel efficiency of up to 75.4% was achieved. The simulation results also indicated that accuracy level was insensitive to the number of partitions or the partitioning algorithms. The effect of reducing the number of species on computational runtime was observed to be more significant than reducing the number of reactions. Besides, the study showed that an increase in the ISAT maximum storage of up to 2 GB reduced the computational runtime by 50%. Also, the ISAT error tolerance of 10 −3 was chosen to strike a balance between results accuracy and computational runtime. The optimised parameters in parallel processing and ISAT, as well as the use of the in-house reduced chemistry model allowed accurate

This paper reviews the methods, benefits and challenges associated with the adoption and translation of computationalfluid dynamics (CFD) modelling within cardiovascular medicine. CFD, a specialist area of mathematics and a branch of fluid mechanics, is used routinely in a diverse range of safety-critical engineering systems, which increasingly is being applied to the cardiovascular system. By facilitating rapid, economical, low-risk prototyping, CFD modelling has already revolutionised research and development of devices such as stents, valve prostheses, and ventricular assist devices. Combined with cardiovascular imaging, CFD simulation enables detailed characterisation of complex physiological pressure and flow fields and the computation of metrics which cannot be directly measured, for example, wall shear stress. CFD models are now being translated into clinical tools for physicians to use across the spectrum of coronary, valvular, congenital, myocardial and peripheral vascular diseases. CFD modelling is apposite for minimally-invasive patient assessment. Patient-specific (incorporating data unique to the individual) and multi-scale (combining models of different length- and time-scales) modelling enables individualised risk prediction and virtual treatment planning. This represents a significant departure from traditional dependence upon registry-based, population-averaged data. Model integration is progressively moving towards 'digital patient' or 'virtual physiological human' representations. When combined with population-scale numerical models, these models have the potential to reduce the cost, time and risk associated with clinical trials. The adoption of CFD modelling signals a new era in cardiovascular medicine. While potentially highly beneficial, a number of academic and commercial groups are addressing the associated methodological, regulatory, education- and service-related challenges. Published by the BMJ Publishing Group Limited. For permission

The MARS (Multi-interfaces Advection and Reconstruction Solver) [1] is one of the surface volume tracking methods for multi-phase flows. Nowadays, the performance of GPU (Graphics Processing Unit) is much higher than the CPU (Central Processing Unit). In this study, the GPU was applied to the MARS in order to accelerate the computation of multi-phase flows (GPU-MARS), and the performance of the GPU-MARS was discussed. From the performance of the interface tracking method for the analyses of one-directional advection problem, it is found that the computing time of GPU(single GTX280) was around 4 times faster than that of the CPU (Xeon 5040, 4 threads parallelized). From the performance of Poisson Solver by using the algorithm developed in this study, it is found that the performance of the GPU showed around 30 times faster than that of the CPU. Finally, it is confirmed that the GPU showed the large acceleration of the fluid flow computation (GPU-MARS) compared to the CPU. However, it is also found that the double-precision computation of the GPU must perform with very high precision.

Even with the continual advances made in both computational algorithms and computer hardware used in reservoir modeling studies, large-scale simulation of fluid and heat flow in heterogeneous reservoirs remains a challenge. The problem commonly arises from intensive computational requirement for detailed modeling investigations of real-world reservoirs. This paper presents the application of a massive parallel-computing version of the TOUGH2 code developed for performing large-scale field simulations. As an application example, the parallelized TOUGH2 code is applied to develop a three-dimensional unsaturated-zone numerical model simulating flow of moisture, gas, and heat in the unsaturated zone of Yucca Mountain, Nevada, a potential repository for high-level radioactive waste. The modeling approach employs refined spatial discretization to represent the heterogeneous fractured tuffs of the system, using more than a million 3-D gridblocks. The problem of two-phase flow and heat transfer within the model domain leads to a total of 3,226,566 linear equations to be solved per Newton iteration. The simulation is conducted on a Cray T3E-900, a distributed-memory massively parallelcomputer. Simulation results indicate that the parallelcomputing technique, as implemented in the TOUGH2 code, is very efficient. The reliability and accuracy of the model results have been demonstrated by comparing them to those of small-scale (coarse-grid) models. These comparisons show that simulation results obtained with the refined grid provide more detailed predictions of the future flow conditions at the site, aiding in the assessment of proposed repository performance

Full Text Available The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well.

The author presents a parallelization of an accelerator physics application to simulate magnetic field in three dimensions. The problem involves the evaluation of high order derivatives with respect to two variables of a multivariate function. Automatic differentiation software had been used with some success, but the computation time was prohibitive. The implementation runs on several platforms, including a network of workstations using PVM, a MasPar using MPFortran, and a CM-5 using CMFortran. A careful examination of the code led to several optimizations that improved its serial performance by a factor of 8.7. The parallelization produced further improvements, especially on the MasPar with a speedup factor of 620. As a result a problem that took six days on a SPARC 10/41 now runs in minutes on the MasPar, making it feasible for physicists at Lawrence Berkeley Laboratory to simulate larger magnets

A distributed preconditioned conjugate gradient method for finite element analysis has been developed and implemented on a parallel SIMD Quadrics computer. The main characteristic of the method is that it does not require any actual assembling of all element equations in a global system. The physical domain of the problem is partitioned in cells of n p finite elements and each cell element is assigned to a different node of an n p -processors machine. Element stiffness matrices are stored in the data memory of the assigned processing node and the solution process is completely executed in parallel at element level. Inter-element and therefore inter-processor communications are required once per iteration to perform local sums of vector quantities between neighbouring elements. A prototype implementation has been tested on an 8-nodes Quadrics machine in a simple 2D benchmark problem

SLAC has proposed the detuned structure (DS) as one possible design to control the emittance growth of long bunch trains due to transverse wakefields in the Next Linear Collider (NLC). The DS consists of 206 cells with tapering from cell to cell of the order of few microns to provide Gaussian detuning of the dipole modes. The decoherence of these modes leads to two orders of magnitude reduction in wakefield experienced by the trailing bunch. To model such a large heterogeneous structure realistically is impractical with finite-difference codes using structured grids. The authors have calculated the wakefield in the DS on a parallelcomputer with a finite-element code using an unstructured grid. The parallel implementation issues are presented along with simulation results that include contributions from higher dipole bands and wall dissipation

In order to avoid the integral bottleneck of conventional SCF calculations, the Resolution of the Identity (RI) method is used to obtain an approximate solution to the Hartree-Fock equations. In this approximation only three-center integrals are needed to build the Fock matrix. It has been implemented as part of the NWChem package of portable and scalable ab initio programs for parallelcomputers. Utilizing the V-approximation, both the Coulomb and exchange contribution to the Fock matrix can be calculated from a transformed set of three-center integrals which have to be precalculated and stored. A distributed in-core method as well as a disk based implementation have been programmed. Details of the implementation as well as the parallel programming tools used are described. We also give results and timings from benchmark calculations

Non-linear computationalfluid models of plasma turbulence based on spectral methods typically spend a large fraction of the total computing time evaluating convolutions. Usually these convolutions arise from an explicit or semi implicit treatment of the convective non-linearities in the problem. Often the principal convective velocity is perpendicular to magnetic field lines allowing a reduction of the convolution to two dimensions in an appropriate geometry, but beyond this, different models vary widely in the particulars of which mode amplitudes are selectively evolved to get the most efficient representation of the turbulence. As the number of modes in the problem, N, increases, the amount of computation required for this part of the evolution algorithm then scales as N 2 /timestep for a direct or analytic method and N ln N/timestep for a pseudospectral method. The constants of proportionality depend on the particulars of mode selection and determine the size problem for which the method will perform equally. For large enough N, the pseudospectral method performance is always superior, though some problems do not require correspondingly high resolution. Further, the Courant condition for numerical stability requires that the timestep size must decrease proportionately as N increases, thus accentuating the need to have fast methods for larger N problems. The authors have developed a package for the Cray system which performs these convolutions for a rather arbitrary mode selection scheme using either method. The package is highly optimized using a combination of macro and microtasking techniques, as well as vectorization and in some cases assembly coded routines. Parts of the package have also been developed and optimized for the CM200 and CM5 system. Performance comparisons with respect to problem size, parallelization, selection schemes and architecture are presented

Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallelcomputers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs

Light acquires or loses coherence and coherence is one of the few optical observables. Spectra can be derived from coherence functions and understanding any interferometric experiment is also relying upon coherence functions. Beyond the two limiting cases (full coherence or incoherence) the coherence of light is always partial and it changes with propagation. We have implemented a code to compute the propagation of partially coherent light from the source plane to the observation plane using parallelcomputing devices (PCDs). In this paper, we restrict the propagation in free space only. To this end, we used the Open Computing Language (OpenCL) and the open-source toolkit PyOpenCL, which gives access to OpenCL parallelcomputation through Python. To test our code, we chose two coherence source models: an incoherent source and a Gaussian Schell-model source. In the former case, we divided into two different source shapes: circular and rectangular. The results were compared to the theoretical values. Our implemented code allows one to choose between the PyOpenCL implementation and a standard one, i.e using the CPU only. To test the computation time for each implementation (PyOpenCL and standard), we used several computer systems with different CPUs and GPUs. We used powers of two for the dimensions of the cross-spectral density matrix (e.g. 324, 644) and a significant speed increase is observed in the PyOpenCL implementation when compared to the standard one. This can be an important tool for studying new source models.

One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.

All fluid dynamic equations are valid under their modeling scales, such as the particle mean free path and mean collision time scale of the Boltzmann equation and the hydrodynamic scale of the Navier-Stokes (NS) equations. The current computationalfluid dynamics (CFD) focuses on the numerical solution of partial differential equations (PDEs), and its aim is to get the accurate solution of these governing equations. Under such a CFD practice, it is hard to develop a unified scheme that covers flow physics from kinetic to hydrodynamic scales continuously because there is no such governing equation which could make a smooth transition from the Boltzmann to the NS modeling. The study of fluid dynamics needs to go beyond the traditional numerical partial differential equations. The emerging engineering applications, such as air-vehicle design for near-space flight and flow and heat transfer in micro-devices, do require further expansion of the concept of gas dynamics to a larger domain of physical reality, rather than the traditional distinguishable governing equations. At the current stage, the non-equilibrium flow physics has not yet been well explored or clearly understood due to the lack of appropriate tools. Unfortunately, under the current numerical PDE approach, it is hard to develop such a meaningful tool due to the absence of valid PDEs. In order to construct multiscale and multiphysics simulation methods similar to the modeling process of constructing the Boltzmann or the NS governing equations, the development of a numerical algorithm should be based on the first principle of physical modeling. In this paper, instead of following the traditional numerical PDE path, we introduce direct modeling as a principle for CFD algorithm development. Since all computations are conducted in a discretized space with limited cell resolution, the flow physics to be modeled has to be done in the mesh size and time step scales. Here, the CFD is more or less a direct

We considered the transient MHD flow of a reactive second grade fluid through porous medium between two infinitely long horizontal parallel plates when one of the plate is set into uniform accelerated motion in the presence of a uniform transverse magnetic field under Arrhenius reaction rate. The governing equations are solved by Laplace transform technique. The effects of the pertinent parameters on the velocity, temperature are discussed in detail. The shear stress and Nusselt number at the plates are also obtained analytically and computationally discussed with reference to governing parameters.

Full Text Available Research in scientitic programming enables us to realize more and more complex applications, and on the other hand, application-driven demands on computing methods and power are continuously growing. Therefore, interdisciplinary approaches become more widely used. The interdisciplinary SPINET project presented in this article applies modern scientific computing tools to biomechanical simulations: parallelcomputing and symbolic and modern functional programming. The target application is the human spine. Simulations of the spine help us to investigate and better understand the mechanisms of back pain and spinal injury. Two approaches have been used: the first uses the finite element method for high-performance simulations of static biomechanical models, and the second generates a simulation developmenttool for experimenting with different dynamic models. A finite element program for static analysis has been parallelized for the MUSIC machine. To solve the sparse system of linear equations, a conjugate gradient solver (iterative method and a frontal solver (direct method have been implemented. The preprocessor required for the frontal solver is written in the modern functional programming language SML, the solver itself in C, thus exploiting the characteristic advantages of both functional and imperative programming. The speedup analysis of both solvers show very satisfactory results for this irregular problem. A mixed symbolic-numeric environment for rigid body system simulations is presented. It automatically generates C code from a problem specification expressed by the Lagrange formalism using Maple.

Quantum transport models for nanodevices using the non-equilibrium Green's function method require the repeated calculation of the block tridiagonal part of the Green's and lesser Green's function matrices. This problem is related to the calculation of the inverse of a sparse matrix. Because of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green's and lesser Green's function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only require computing a small number of entries of the inverse matrix. Then, we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size.

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

This volume contains papers from most of the invited talks and from several of the contributed talks and poster sessions presented at VAPP II. The contents present an extensive coverage of all important aspects of vector and parallel processors, including hardware, languages, numerical algorithms and applications. The topics covered include descriptions of new machines (both research and commercial machines), languages and software aids, and general discussions of whole classes of machines and their uses. Numerical methods papers include Monte Carlo algorithms, iterative and direct methods for solving large systems, finite elements, optimization, random number generation and mathematical software. The specific applications covered include neutron diffusion calculations, molecular dynamics, weather forecasting, lattice gauge calculations, fluid dynamics, flight simulation, cartography, image processing and cryptography. Most machines and architecture types are being used for these applications. many refs.

Full Text Available CX, a network-based computational exchange, is presented. The system's design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, eager scheduling, and space-based coordination. The object-oriented API is simple, compact, and cleanly separates application logic from the logic that supports interprocess communication and fault tolerance. Computations, of course, run to completion in the presence of computational hosts that join and leave the ongoing computation. Such hosts, or producers, use task caching and prefetching to overlap computation with interprocessor communication. To break a potential task server bottleneck, a network of task servers is presented. Even though task servers are envisioned as reliable, the self-organizing, scalable network of n- servers, described as a sibling-connected height-balanced fat tree, tolerates a sequence of n-1 server failures. Tasks are distributed throughout the server network via a simple "diffusion" process. CX is intended as a test bed for research on automated silent auctions, reputation services, authentication services, and bonding services. CX also provides a test bed for algorithm research into network-based parallelcomputation.

Verification and validation (V&V) are the primary means to assess accuracy and reliability in computational simulations. This paper presents an extensive review of the literature in V&V in computationalfluid dynamics (CFD), discusses methods and procedures for assessing V&V, and develops a number of extensions to existing ideas. The review of the development of V&V terminology and methodology points out the contributions from members of the operations research, statistics, and CFD communities. Fundamental issues in V&V are addressed, such as code verification versus solution verification, model validation versus solution validation, the distinction between error and uncertainty, conceptual sources of error and uncertainty, and the relationship between validation and prediction. The fundamental strategy of verification is the identification and quantification of errors in the computational model and its solution. In verification activities, the accuracy of a computational solution is primarily measured relative to two types of highly accurate solutions: analytical solutions and highly accurate numerical solutions. Methods for determining the accuracy of numerical solutions are presented and the importance of software testing during verification activities is emphasized. The fundamental strategy of validation is to assess how accurately the computational results compare with the experimental data, with quantified error and uncertainty estimates for both. This strategy employs a hierarchical methodology that segregates and simplifies the physical and coupling phenomena involved in the complex engineering system of interest. A hypersonic cruise missile is used as an example of how this hierarchical structure is formulated. The discussion of validation assessment also encompasses a number of other important topics. A set of guidelines is proposed for designing and conducting validation experiments, supported by an explanation of how validation experiments are different

Journal Article Three-dimensional computationalfluid dynamics and Lagrangian particle deposition models were developed to compare the deposition of aerosolized Bacillus anthracis spores in the respiratory airways of a human with that of the rabbit, a species commonly used in the study of anthrax disease. The respiratory airway geometries for each species were derived from computed tomography (CT) or µCT images. Both models encompassed airways that extended from the external nose to the lung with a total of 272 outlets in the human model and 2878 outlets in the rabbit model. All simulations of spore deposition were conducted under transient, inhalation-exhalation breathing conditions using average species-specific minute volumes. Four different exposure scenarios were modeled in the rabbit based upon experimental inhalation studies. For comparison, human simulations were conducted at the highest exposure concentration used during the rabbit experimental exposures. Results demonstrated that regional spore deposition patterns were sensitive to airway geometry and ventilation profiles. Despite the complex airway geometries in the rabbit nose, higher spore deposition efficiency was predicted in the upper conducting airways of the human at the same air concentration of anthrax spores. This greater deposition of spores in the upper airways in the human resulted in lower penetration and deposition in the tracheobronchial airways and the deep lung than that predict

An apparatus, program product and method optimize the operation of a massively parallelcomputer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.

A program has been written which performs particle orbit tracking on the Intel iPSC/860 distributed memory parallelcomputer. The tracking is performed using a thin element approach. A brief description of the structure and performance of the code is presented, along with applications of the code to the analysis of accelerator lattices for the SSC. The concept of ''ensemble tracking'', i.e. the tracking of ensemble averages of noninteracting particles, such as the emittance, is presented. Preliminary results of such studies will be presented. 2 refs., 6 figs

offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...

What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the

Efficient numerical tools coupled with high-performance computers, have become a key element of the design process in the fields of energy supply and transportation. However flow phenomena that occur in complex systems such as gas turbines and aircrafts are still not understood mainly because of the models that are needed. In fact, most computationalfluid dynamics (CFD) predictions as found today in industry focus on a reduced or simplified version of the real system (such as a periodic sector) and are usually solved with a steady-state assumption. This paper shows how to overcome such barriers and how such a new challenge can be addressed by developing flow solvers running on high-end computing platforms, using thousands of computing cores. Parallel strategies used by modern flow solvers are discussed with particular emphases on mesh-partitioning, load balancing and communication. Two examples are used to illustrate these concepts: a multi-block structured code and an unstructured code. Parallelcomputing strategies used with both flow solvers are detailed and compared. This comparison indicates that mesh-partitioning and load balancing are more straightforward with unstructured grids than with multi-block structured meshes. However, the mesh-partitioning stage can be challenging for unstructured grids, mainly due to memory limitations of the newly developed massively parallel architectures. Finally, detailed investigations show that the impact of mesh-partitioning on the numerical CFD solutions, due to rounding errors and block splitting, may be of importance and should be accurately addressed before qualifying massively parallel CFD tools for a routine industrial use.

This paper proposes a general method for incorporating rule-based constraints corresponding to regular languages into stochastic inference problems, thereby allowing for a unified representation of stochastic and syntactic pattern constraints. The authors' approach first established the formal connection of rules to Chomsky grammars, and generalizes the original work of Shannon on the encoding of rule-based channel sequences to Markov chains of maximum entropy. This maximum entropy probabilistic view leads to Gibb's representations with potentials which have their number of minima growing at precisely the exponential rate that the language of deterministically constrained sequences grow. These representations are coupled to stochastic diffusion algorithms, which sample the language-constrained sequences by visiting the energy minima according to the underlying Gibbs' probability law. The coupling to stochastic search methods yields the all-important practical result that fully parallel stochastic cellular automata may be derived to generate samples from the rule-based constraint sets. The production rules and neighborhood state structure of the language of sequences directly determines the necessary connection structures of the required parallelcomputing surface. Representations of this type have been mapped to the DAP-510 massively-parallel processor consisting of 1024 mesh-connected bit-serial processing elements for performing automated segmentation of electron-micrograph images.

The Idaho National Laboratory (INL), under the auspices of the U.S. Department of Energy, is performing research and development that focuses on key phenomena important during potential scenarios that may occur in very high temperature reactors (VHTRs). Phenomena Identification and Ranking Studies to date have ranked an air ingress event, following on the heels of a VHTR depressurization, as important with regard to core safety. Consequently, the development of advanced air ingress-related models and verification and validation data are a very high priority. Following a loss of coolant and system depressurization incident, air will enter the core of the High Temperature Gas Cooled Reactor through the break, possibly causing oxidation of the in-the core and reflector graphite structure. Simple core and plant models indicate that, under certain circumstances, the oxidation may proceed at an elevated rate with additional heat generated from the oxidation reaction itself. Under postulated conditions of fluid flow and temperature, excessive degradation of the lower plenum graphite can lead to a loss of structural support. Excessive oxidation of core graphite can also lead to the release of fission products into the confinement, which could be detrimental to a reactor safety. Computationalfluid dynamic model developed in this study will improve our understanding of this phenomenon. This paper presents two-dimensional and three-dimensional CFD results for the quantitative assessment of the air ingress phenomena. A portion of results of the density-driven stratified flow in the inlet pipe will be compared with results of the experimental results.

This paper proposes the organization of a parallelcomputer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computationalCompute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.

Important Nomenclature Kinematics of Fluid Motion Introduction to Continuum Motion Fluid Particles Inertial Coordinate Frames Motion of a Continuum The Time Derivatives Velocity and Acceleration Steady and Nonsteady Flow Trajectories of Fluid Particles and Streamlines Material Volume and Surface Relation between Elemental Volumes Kinematic Formulas of Euler and Reynolds Control Volume and Surface Kinematics of Deformation Kinematics of Vorticity and Circulation References Problems The Conservation Laws and the Kinetics of Flow Fluid Density and the Conservation of Mass Prin

In this talk I discuss the general question of the portability of molecular dynamics codes for diffusive systems on parallelcomputers of the APE family. The intrinsic single precision of the today available platforms does not seem to affect the numerical accuracy of the simulations, while the absence of integer addressing from CPU to individual nodes puts strong constraints on possible programming strategies. Liquids can be satisfactorily simulated using the ''systolic'' method. For more complex systems, like the biological ones at which we are ultimately interested in, the ''domain decomposition'' approach is best suited to beat the quadratic growth of the inter-molecular computational time with the number of atoms of the system. The promising perspectives of using this strategy for extensive simulations of lipid bilayers are briefly reviewed. (orig.)

A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.

Research on advanced parallel hardware and software architectures for very high-speed computation deserves and needs more support and attention to fulfil its promise. Novel architectures for parallel processing are being made ready. Architectures for parallel processing can be roughly divided into two groups. One is a vector processor in which a single central processing unit involves multiple vector-arithmetic registers. The other is a processor array in which slave processors are connected to a host processor to perform parallelcomputation. In this review, the concept and data structure of the Adena (alternating-direction edition nexus array) architecture, which is conformable to distributed-parameter simulation algorithms, are described. 5 references.

This report summarizes research conducted at the Institute for Computer Applications in Science and Engineering in applied mathematics, fluid mechanics, and computer science during the period October 1, 1998 through March 31, 1999.

This report summarizes research conducted at the Institute for Computer Applications in Science and Engineering in applied mathematics, fluid mechanics, and computer science during the period April 1, 1995 through September 30, 1995.

We study the flows of a nonuniformly heated weakly conducting fluid in an ac electric field of a horizontal parallel-plate capacitor. Analysis is carried out for fluids in which the charge formation is governed by electroconductive mechanism associated with the temperature dependence of the electrical conductivity of the medium. Periodic and chaotic regimes of fluid flow are investigated in the limiting case of instantaneous charge relaxation and for a finite relaxation time. Bifurcation diagrams and electroconvective regimes charts are constructed. The regions where fluid oscillations synchronize with the frequency of the external field are determined. Hysteretic transitions between electroconvection regimes are studied. The scenarios of transition to chaotic oscillations are analyzed. Depending on the natural frequency of electroconvective system and the external field frequency, the transition from periodic to chaotic oscillations can occur via quasiperiodicity, a subharmonic cascade, or intermittence

We study the flows of a nonuniformly heated weakly conducting fluid in an ac electric field of a horizontal parallel-plate capacitor. Analysis is carried out for fluids in which the charge formation is governed by electroconductive mechanism associated with the temperature dependence of the electrical conductivity of the medium. Periodic and chaotic regimes of fluid flow are investigated in the limiting case of instantaneous charge relaxation and for a finite relaxation time. Bifurcation diagrams and electroconvective regimes charts are constructed. The regions where fluid oscillations synchronize with the frequency of the external field are determined. Hysteretic transitions between electroconvection regimes are studied. The scenarios of transition to chaotic oscillations are analyzed. Depending on the natural frequency of electroconvective system and the external field frequency, the transition from periodic to chaotic oscillations can occur via quasiperiodicity, a subharmonic cascade, or intermittence.

Morphological attribute filters have not previously been parallelized mainly because they are both global and nonseparable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings, and thickenings,

ComputationalFluid Dynamics (CFD) is an important design tool in engineering and also a substantial research tool in various physical sciences as well as in biology. The objective of this book is to provide university students with a solid foundation for understanding the numerical methods employed in today's CFD and to familiarise them with modern CFD codes by hands-on experience. It is also intended for engineers and scientists starting to work in the field of CFD or for those who apply CFD codes. Due to the detailed index, the text can serve as a reference handbook too. Each chapter includes an extensive bibliography, which provides an excellent basis for further studies. The accompanying companion website contains the sources of 1-D and 2-D Euler and Navier-Stokes flow solvers (structured and unstructured) as well as of grid generators. Provided are also tools for Von Neumann stability analysis of 1-D model equations. Finally, the companion website includes the source code of a dedicated visualisation so...

The potential of computationfluid dynamics (CFD) for conceiving ventilation systems is shown through the simulation of five practical cases. The following examples are considered: capture of pollutants on a surface treating tank equipped with a unilateral suction slot in the presence of a disturbing air draft opposed to suction; dispersion of solid aerosols inside fume cupboards; performances comparison of two general ventilation systems in a silkscreen printing workshop; ventilation of a large open painting area; and oil fog removal inside a mechanical engineering workshop. Whereas the two first problems are analyzed through two dimensional numerical simulations, the three other cases require three dimensional modeling. For the surface treating tank case, numerical results are compared to laboratory experiment data. All simulations are carried out using EOL, a CFD software specially devised to deal with air quality problems in industrial ventilated premises. It contains many analysis tools to interpret the results in terms familiar to the industrial hygienist. Much experimental work has been engaged to validate the predictions of EOL for ventilation flows.

Parallelcomputation of unsteady flows requires significant computational resources. The utilization of a network of workstations seems an efficient solution to the problem where large problems can be treated at a reasonable cost. This approach requires the solution of several problems: 1) the partitioning and distribution of the problem over a network of workstation, 2) efficient communication tools, 3) managing the system efficiently for a given problem. Of course, there is the question of the efficiency of any given numerical algorithm to such a computing system. NPARC code was chosen as a sample for the application. For the explicit version of the NPARC code both two- and three-dimensional problems were studied. Again both steady and unsteady problems were investigated. The issues studied as a part of the research program were: 1) how to distribute the data between the workstations, 2) how to compute and how to communicate at each node efficiently, 3) how to balance the load distribution. In the following, a summary of these activities is presented. Details of the work have been presented and published as referenced.

We introduce a high-performance virtual machine (VM) written in a numerically fast language like Fortran or C to evaluate very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte code from the optimal matrix element generator, O'Mega. Furthermore, this approach allows to formulate the parallelcomputation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behaviour with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.

Full Text Available As defined by the National Institutes of Health: “Biomedical engineering integrates physical, chemical, mathematical, and computational sciences and engineering principles to study biology, medicine, behavior, and health”. Many issues in this area are closely related to fluid dynamics. This paper provides an overview of the basic concepts concerning ComputationalFluid Dynamics and its applications in medicine.

As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user

Delay systems subject to delayed optical feedback have recently shown great potential in solving computationally hard tasks. By implementing a neuro-inspired computational scheme relying on the transient response to optical data injection, high processing speeds have been demonstrated. However, reservoir computing systems based on delay dynamics discussed in the literature are designed by coupling many different stand-alone components which lead to bulky, lack of long-term stability, non-monolithic systems. Here we numerically investigate the possibility of implementing reservoir computing schemes based on semiconductor ring lasers. Semiconductor ring lasers are semiconductor lasers where the laser cavity consists of a ring-shaped waveguide. SRLs are highly integrable and scalable, making them ideal candidates for key components in photonic integrated circuits. SRLs can generate light in two counterpropagating directions between which bistability has been demonstrated. We demonstrate that two independent machine learning tasks , even with different nature of inputs with different input data signals can be simultaneously computed using a single photonic nonlinear node relying on the parallelism offered by photonics. We illustrate the performance on simultaneous chaotic time series prediction and a classification of the Nonlinear Channel Equalization. We take advantage of different directional modes to process individual tasks. Each directional mode processes one individual task to mitigate possible crosstalk between the tasks. Our results indicate that prediction/classification with errors comparable to the state-of-the-art performance can be obtained even with noise despite the two tasks being computed simultaneously. We also find that a good performance is obtained for both tasks for a broad range of the parameters. The results are discussed in detail in [Nguimdo et al., IEEE Trans. Neural Netw. Learn. Syst. 26, pp. 3301-3307, 2015

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallelcomputer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

This newsletter reports on computational works performed by the National Laboratory of Hydraulics (LNH) from Electricite de France (EdF). Two papers were selected which concern the simulation of the Paluel nuclear power plant plume and the computation of particles and droplets inside a cooling tower. (J.S.)

Methods, systems, and products are disclosed for pacing a data transfer between compute nodes on a parallelcomputer that include: transferring, by an origin compute node, a chunk of an application message to a target compute node; sending, by the origin compute node, a pacing request to a target direct memory access (`DMA`) engine on the target compute node using a remote get DMA operation; determining, by the origin compute node, whether a pacing response to the pacing request has been received from the target DMA engine; and transferring, by the origin compute node, a next chunk of the application message if the pacing response to the pacing request has been received from the target DMA engine.

During the last three decades, breakthroughs in computer technology have made a tremendous impact on optimization. In particular, parallelcomputing has made it possible to solve larger and computationally more difficult prob­ lems. This volume contains mainly lecture notes from a Nordic Summer School held at the Linkoping Institute of Technology, Sweden in August 1995. In order to make the book more complete, a few authors were invited to contribute chapters that were not part of the course on this first occasion. The purpose of this Nordic course in advanced studies was three-fold. One goal was to introduce the students to the new achievements in a new and very active field, bring them close to world leading researchers, and strengthen their competence in an area with internationally explosive rate of growth. A second goal was to strengthen the bonds between students from different Nordic countries, and to encourage collaboration and joint research ventures over the borders. In this respect, the course bui...

Karlsruhe Institute of Technology (KIT) is developing the parallelcomputationalfluid dynamics code GASFLOW-MPI as a best-estimate tool for predicting transport, mixing, and combustion of hydrogen and other gases in nuclear reactor containments and other facility buildings. GASFLOW-MPI is a finite-volume code based on proven computationalfluid dynamics methodology that solves the compressible Navier-Stokes equations for three-dimensional volumes in Cartesian or cylindrical coordinates.

In this Technical Note, we present a family of Jacobi-based multigrid smoothers suitable for the solution of discretized elliptic equations. These smoothers are based on the idea of scheduled-relaxation Jacobi proposed recently by Yang & Mittal (2014) [18] and employ two or three successive relaxed Jacobi iterations with relaxation factors derived so as to maximize the smoothing property of these iterations. The performance of these new smoothers measured in terms of convergence acceleration and computational workload, is assessed for multi-domain implementations typical of parallelized solvers, and compared to the lexicographic point Gauss-Seidel smoother. The tests include the geometric multigrid method on structured grids as well as the algebraic grid method on unstructured grids. The tests demonstrate that unlike Gauss-Seidel, the convergence of these Jacobi-based smoothers is unaffected by domain decomposition, and furthermore, they outperform the lexicographic Gauss-Seidel by factors that increase with domain partition count.

Optimizing collective operations using direct memory access controller on a parallelcomputer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.

The book begins with an overview of the phase diagrams of fluid mixtures (fluid = liquid, gas, or supercritical state), which can show an astonishing variety when elevated pressures are taken into account; phenomena like retrograde condensation (single and double) and azeotropy (normal and double) are discussed. It then gives an introduction into the relevant thermodynamic equations for fluid mixtures, including some that are rarely found in modern textbooks, and shows how they can they be used to compute phase diagrams and related properties. This chapter gives a consistent and axiomatic approach to fluid thermodynamics; it avoids using activity coefficients. Further chapters are dedicated to solid-fluid phase equilibria and global phase diagrams (systematic search for phase diagram classes). The appendix contains numerical algorithms needed for the computations. The book thus enables the reader to create or improve computer programs for the calculation of fluid phase diagrams. introduces phase diagram class...

ComputationalFluid Dynamics (CFD) is the branch of Fluid Mechanics and Computational Physics that plays a decent role in modern Mechanical Engineering Design process due to such advantages as relatively low cost of simulation comparing with conduction of real experiment, an opportunity to easily correct the design of a prototype prior to manufacturing of the final product and a wide range of application: mixing, acoustics, cooling and aerodynamics. This makes CFD particularly and Computation...

The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer

Full Text Available In the present study, second law analysis is introduced for circular cylinder confined between parallel planes. An analytical approach is adopted to study the effects of block age, Reynolds and Prandtl numbers on the entropy generation due to the laminar flow and heat transfer. Four different fluids are considered in the present analysis for comparison purposes. Heat transfer for the cylinder at an isothermal boundary condition is incorporated. In general, the entropy generation rate decreases as the blockage ratio decreases. In addition, the entropy generation rate increases with increasing Reynolds and Prandtl numbers. At a fixed Reynolds number, the effect of block age becomes more notice able for higher Prandtl number fluid. Similarly, for the same fluid, the effect of block age becomes more no tice able as the Reynolds number increases.

Computed Tomography (CT or CAT scan) is a noninvasive diagnostic imaging procedure that uses a combination of X-rays and computer technology to produce horizontal, or axial, images (often called slices) of the body. A CT scan shows detailed images of any part of the body, including the bones muscles, fat and organs CT scans are more detailed than standard x-rays. CT scans may be done with or without "contrast Contrast refers to a substance taken by mouth and/ or injected into an intravenous (IV) line that causes the particular organ or tissue under study to be seen more clearly. CT scan of the liver and biliary tract are used in the diagnosis of many diseases in the abdomen structures, particularly when another type of examination, such as X-rays, physical examination, and ultra sound is not conclusive. Unfortunately, the presence of noise and artifact in the edges and fine details in the CT images limit the contrast resolution and make diagnostic procedure more difficult. This experimental study was conducted at the College of Medical Radiological Science, Sudan University of Science and Technology and Fidel Specialist Hospital. The sample of study was included 50 patients. The main objective of this research was to study an accurate liver segmentation method using a parallelcomputing algorithm, and to segment liver and adjacent organs using image processing technique. The main technique of segmentation used in this study was watershed transform. The scope of image processing and analysis applied to medical application is to improve the quality of the acquired image and extract quantitative information from medical image data in an efficient and accurate way. The results of this technique agreed wit the results of Jarritt et al, (2010), Kratchwil et al, (2010), Jover et al, (2011), Yomamoto et al, (1996), Cai et al (1999), Saudha and Jayashree (2010) who used different segmentation filtering based on the methods of enhancing the computed tomography images. Anther

In the last decade, computer technology has evolved rapidly. Modern high performance computing systems offer a tremendous amount of computing power in the range of a few peta floating point operations per second. In contrast, numerical software development is much slower and most existing simulation codes cannot exploit the full computing power of these systems. Partially, this is due to the numerical methods themselves and partially it is related to bottlenecks within the parallelization concept and its data structures. The goal of the thesis is the development of numerical algorithms and corresponding data structures to remedy both kinds of parallelization bottlenecks. The approach is based on a co-design of the numerical schemes (including numerical analysis) and their realizations in algorithms and software. Various kinds of applications, from multicomponent flows (Lattice Boltzmann Method) to electrodynamics (Discontinuous Galerkin Method) to embedded geometries (Octree), are considered and efficiency of the developed approaches is demonstrated for large scale simulations.

A general formulation allowing computation of structure vibrations in a dense fluid is described. It is based on fluid modelisation by fluid finite elements. For each fluid node are associated two variables: the pressure p and a variable π defined as p=d 2 π/dt 2 . Coupling between structure and fluid is introduced by surface elements. This method is easy to introduce in a general finite element code. Validation was obtained by analytical calculus and tests. It is widely used for vibrational and seismic studies of pipes and internals of nuclear reactors some applications are presented [fr

This textbook presents numerical solution techniques for incompressible turbulent flows that occur in a variety of scientific and engineering settings including aerodynamics of ground-based vehicles and low-speed aircraft, fluid flows in energy systems, atmospheric flows, and biological flows. This book encompasses fluid mechanics, partial differential equations, numerical methods, and turbulence models, and emphasizes the foundation on how the governing partial differential equations for incompressible fluid flow can be solved numerically in an accurate and efficient manner. Extensive discussions on incompressible flow solvers and turbulence modeling are also offered. This text is an ideal instructional resource and reference for students, research scientists, and professional engineers interested in analyzing fluid flows using numerical simulations for fundamental research and industrial applications. • Introduces CFD techniques for incompressible flow and turbulence with a comprehensive approach; • Enr...

Field data-based simulations of geologic systems require much computational time because of their mathematical complexity and the often desired large scales in space and time. To conduct accurate simulations in an acceptable time period, methods to reduce runtime are required. A parallelization

The Transient Reactor Analysis Code (TRAC), which features a two-fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local, the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, a fixed, uniform assignment of nodes to prallel processors will result in degraded computational efficiency due to the poor load balancing. A standard method for treating data-dependent models on vector architectures has been to use gather operations (or indirect adressing) to sort the nodes into subsets that (temporarily) share a common computational model. However, this method is not effective on distributed memory data parallel architectures, where indirect adressing involves expensive communication overhead. Another serious problem with this method involves software engineering challenges in the areas of maintainability and extensibility. For example, an implementation that was hand-tuned to achieve good computational efficiency would have to be rewritten whenever the decision tree governing the sorting was modified. Using an example based on the calculation of the wall-to-liquid and wall-to-vapor heat-transfer coefficients for three nonboiling flow regimes, we describe how the use of the Fortran 90 WHERE construct and automatic inlining of functions can be used to ameliorate this problem while improving both efficiency and software engineering. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. We discuss why developers should either wait for such solutions or consider alternative numerical algorithms, such as a neural network

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallelcomputation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallelcomputation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallelcomputations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

ComputationalFluid Dynamics (CFD) is an analysis of fluid flow, heat transfer and associated phenomena in physical systems using computers. CFD has been used at CERN since 1993 by the TS-CV group, to solve thermo-fluid related problems, particularly during the development, design and construction phases of the LHC experiments. Computer models based on CFD techniques can be employed to reduce the effort required for prototype testing, saving not only time and money but offering possibilities of additional investigations and design optimisation. The development of a more efficient support team at CERN depends on to two important factors: available computing power and experienced engineers. Available computer power IS the limiting resource of CFD. Only the recent increase of computer power had allowed important high tech and industrial applications. Computer Grid is already now (OpenLab at CERN) and will be more so in the future natural environment for CFD science. At CERN, CFD activities have been developed by...

This paper considers an algebraic preconditioning algorithm for hyperbolic-elliptic fluid flow problems. The algorithm is based on a parallel non-overlapping Schur complement domain-decomposition technique for triangulated domains. In the Schur complement technique, the triangulation is first partitioned into a number of non-overlapping subdomains and interfaces. This suggests a reordering of triangulation vertices which separates subdomain and interface solution unknowns. The reordering induces a natural 2 x 2 block partitioning of the discretization matrix. Exact LU factorization of this block system yields a Schur complement matrix which couples subdomains and the interface together. The remaining sections of this paper present a family of approximate techniques for both constructing and applying the Schur complement as a domain-decomposition preconditioner. The approximate Schur complement serves as an algebraic coarse space operator, thus avoiding the known difficulties associated with the direct formation of a coarse space discretization. In developing Schur complement approximations, particular attention has been given to improving sequential and parallel efficiency of implementations without significantly degrading the quality of the preconditioner. A computer code based on these developments has been tested on the IBM SP2 using MPI message passing protocol. A number of 2-D calculations are presented for both scalar advection-diffusion equations as well as the Euler equations governing compressible fluid flow to demonstrate performance of the preconditioning algorithm.

Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.

Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...

Full Text Available The architecture of a digital computing system determines the technical foundation of a unified mathematical language for exact arithmetic-logical description of phenomena and laws of continuum mechanics for applications in fluid mechanics and theoretical physics. The deep parallelization of the computing processes results in functional programming at a new technological level, providing traceability of the computing processes with automatic application of multiscale hybrid circuits and adaptive mathematical models for the true reproduction of the fundamental laws of physics and continuum mechanics.

Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.

As computing power increases exponentially, vast amount of data is created by many scientific re- search activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of in- dexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of read- ing data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.

Momentum transfer between an ionized gas cloud moving relative to an ambient magnetized plasma is a general problem in space plasma physics. Obvious examples include the Io-Jupiter interaction, comets, and coronal mass ejections. Active plasma experiments have demonstrated that momentum transfer rates associated with Alfven wave propagation are poorly understood. Barium injection experiments from the Combined Release and Radiation Effects Satellite (CRRES) have shown that dense ionized clouds are capable of ExB drifting over large distances perpendicular to the magnetic field. The CRRES 'skidding' distances were much larger than predicted by magnetohydrodynamic theory and it has been proposed that parallel electric fields were a key component in the skidding phenomenon. A two-fluid code was used to demonstrate the role of parallel electric fields in reducing momentum transfer between two distinct plasma populations. In this study, a dense plasma was initialized moving relative to an ambient plasma and perpendicular to B. Parallel electric fields were introduced via a friction term in the electron momentum equation and the collision frequency was scaled in proportion to the field-aligned current density. The simulation results showed that parallel electric fields decreased the decelerating magnetic tension force on the plasma cloud through a magnetic diffusion/reconnection process

In a parallelcomputer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.

Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...

There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallelcomputing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.

Full Text Available One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly important for computations with unpredictable dynamic behaviors or irregular structures. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The adaptive contracting within neighborhood (ACWN is a dynamic, distributed, load-dependent, and scalable scheme. It deals with dynamic and unpredictable creation of processes and adapts to different systems. The scheme is described and contrasted with two other schemes that have been proposed in this context, namely the randomized allocation and the gradient model. The performance of the three schemes on an Intel iPSC/2 hypercube is presented and analyzed. The experimental results show that even though the ACWN algorithm incurs somewhat larger overhead than the randomized allocation, it achieves better performance in most cases due to its adaptiveness. Its feature of quickly spreading the work helps it outperform the gradient model in performance and scalability.

There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.

In this paper the parallelcomputing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallelcomputer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallelcomputing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.

ParallelComputing for Data Science: With Examples in R, C++ and CUDA is one of the first parallelcomputing books to concentrate exclusively on parallel data structures, algorithms, software tools, and applications in data science. It includes examples not only from the classic ""n observations, p variables"" matrix format but also from time series, network graph models, and numerous other structures common in data science. The examples illustrate the range of issues encountered in parallel programming.With the main focus on computation, the book shows how to compute on three types of platfor

The basic characteristics of the flow in a disc type non-equilibrium MHD power generator were studied. The flow of conductive fluid between parallel disks in an axial magnetic field was analyzed as the subsonic MHD turbulent approach flow of viscous compressible fluid, taking the electron temperature dependence of conductivity into account. The equations for the flow between disks are described by ordinary electromagnetic hydrodynamic approximation. Practical numerical calculation was performed for the non-equilibrium argon plasma seeded with potassium. The effects of the variation of characteristics of non-equilibrium plasma in main flow and boundary layer on the flow characteristics became clear. The qualitative tendency of the properties of MHD generators can be well explained. (Kato, T.)

The motion of charged suspended particle in a non-Newtonian fluid between two long parallel plates is discussed. The equation of motion of a suspended particle was suggested by Closkin. The equations of motion are reduced to ordinary differential equations by similarity transformations and solved numerically by using the Runge-Kutta method. The trajectories of particles are calculated by integrating the equation of motion of a single particle. The present simulation requires some empirical parameters concerning the collision of the particles with the wall. The effects of solid particles on flow properties are discussed. Some typical results for both fluid and particle phases and density distributions of the particles are presented graphically

The motion of charged suspended particle in a non-Newtonian fluid between two long parallel plates is discussed. The equation of motion of a suspended particle was suggested by Closkin. The equations of motion are reduced to ordinary differential equations by similarity transformation and solved numerically by using Runge-Kutta method. The trajectories of particles are calculated by integrating the equation of motion of a single particle. The present simulation requires some empirical parameters concerning the collision of the particles with the wall. The effect of solid particles on flow properties are discussed. Some typical results for both fluid and particle phases and density distributions of the particles are presented graphically. 4 figs.

A new algorithm for calculating magnetic fields in a concentrated magnetic fluid with inhomogeneous density is proposed. Inhomogeneity of the fluid is caused by magnetophoresis. In this case, the diffusion and magnetostatic parts of the problem are tightly linked together and are solved jointly. The dynamic diffusion equation is solved by the finite volume method and, to calculate the magnetic field inside the fluid, an iterative process is performed in parallel. The solution to the problem is sought in Cartesian coordinates, and the computational domain is decomposed into rectangular elements. This technique eliminates the need to solve the related boundary-value problem for magnetic fields, accelerates computations and eliminates the error caused by the finite sizes of the outer region. Formulas describing the contribution of the rectangular element to the field intensity in the case of a plane problem are given. Magnetic and concentration fields inside the magnetic fluid filling a rectangular cavity generated under the action of the uniform external filed are calculated. - Highlights: ▶ New algorithm for calculating magnetic field intense magnetic fluid with account of magnetophoresis and diffusion of particles. ▶ We do not need to solve boundary-value problem, but we accelerate computations and eliminate some errors. ▶ We solve nonlinear flow equation by the finite volume method and calculate magnetic and focus fields in the fluid for plane case.

Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallelcomputer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.

Ever since computers were first used for scientific and numerical work, there has existed an "arms race" between the technical development of faster computing hardware, and the desires of scientists to solve larger problems in shorter time-scales. However, the vast leaps in processor performance achieved through advances in semi-conductor science have reached a hiatus as the technology comes up against the physical limits of the speed of light and quantum effects. This has lead all high performance computer manufacturers to turn towards a parallel architecture for their new machines. In these lectures we will introduce the history and concepts behind parallelcomputing, and review the various parallel architectures and software environments currently available. We will then introduce programming methodologies that allow efficient exploitation of parallel machines, and present case studies of the parallelization of typical High Energy Physics codes for the two main classes of parallelcomputing architecture (S...

Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's ParallelComputing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.

In this paper, a FEM-based (finite element method) mesh free method with a probabilistic node generation technique is presented. In the proposed method, all computational procedures, from the mesh generation to the solution of a system of equations, can be performed fluently in parallel in terms of nodes. Local finite element mesh is generated robustly around each node, even for harsh boundary shapes such as cracks. The algorithm and the data structure of finite element calculation are based on nodes, and parallelcomputing is realized by dividing a system of equations by the row of the global coefficient matrix. In addition, the node-based finite element method is accompanied by a probabilistic node generation technique, which generates good-natured points for nodes of finite element mesh. Furthermore, the probabilistic node generation technique can be performed in parallel environments. As a numerical example of the proposed method, we perform a compressible flow simulation containing strong shocks. Numerical simulations with frequent mesh refinement, which are required for such kind of analysis, can effectively be performed on parallel processors by using the proposed method. (authors)

In this paper, a FEM-based (finite element method) mesh free method with a probabilistic node generation technique is presented. In the proposed method, all computational procedures, from the mesh generation to the solution of a system of equations, can be performed fluently in parallel in terms of nodes. Local finite element mesh is generated robustly around each node, even for harsh boundary shapes such as cracks. The algorithm and the data structure of finite element calculation are based on nodes, and parallelcomputing is realized by dividing a system of equations by the row of the global coefficient matrix. In addition, the node-based finite element method is accompanied by a probabilistic node generation technique, which generates good-natured points for nodes of finite element mesh. Furthermore, the probabilistic node generation technique can be performed in parallel environments. As a numerical example of the proposed method, we perform a compressible flow simulation containing strong shocks. Numerical simulations with frequent mesh refinement, which are required for such kind of analysis, can effectively be performed on parallel processors by using the proposed method. (authors)

In Japan Atomic Energy Agency (JAEA), the Innovative Water Reactor for Flexible Fuel Cycle (FLWR) has been developed. For thermal design of FLWR, it is necessary to develop analytical method to predict boiling transition of FLWR. Japan Atomic Energy Agency (JAEA) has been developing three-dimensional two-fluid model analysis code ACE-3D, which adopts boundary fitted coordinate system to simulate complex shape channel flow. In this paper, as a part of development of ACE-3D to apply to rod bundle analysis, introduction of parallelization to ACE-3D and assessments of ACE-3D are shown. In analysis of large-scale domain such as a rod bundle, even two-fluid model requires large number of computational cost, which exceeds upper limit of memory amount of 1 CPU. Therefore, parallelization was introduced to ACE-3D to divide data amount for analysis of large-scale domain among large number of CPUs, and it is confirmed that analysis of large-scale domain such as a rod bundle can be performed by parallelcomputation with keeping parallelcomputation performance even using large number of CPUs. ACE-3D adopts two-phase flow models, some of which are dependent upon channel geometry. Therefore, analyses in the domains, which simulate individual subchannel and 37 rod bundle, are performed, and compared with experiments. It is confirmed that the results obtained by both analyses using ACE-3D show agreement with past experimental result qualitatively. (author)

Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.

In order to improve the accuracy and calculating speed of shielding analyses, MCNP 4, a Monte Carlo neutron and photon transport code system, has been parallelized and measured of its efficiency in the highly parallel distributed memory type computer, AP1000. The code has been analyzed statically and dynamically, then the suitable algorithm for parallelization has been determined for the shielding analysis functions of MCNP 4. This includes a strategy where a new history is assigned to the idling processor element dynamically during the execution. Furthermore, to avoid the congestion of communicative processing, the batch concept, processing multi-histories by a unit, has been introduced. By analyzing a sample cask problem with 2,000,000 histories by the AP1000 with 512 processor elements, the 82 % of parallelization efficiency is achieved, and the calculational speed has been estimated to be around 50 times as fast as that of FACOM M-780. (author)

One of Sandia`s research efforts is to reduce the end-to-end communication delay in a parallel-distributed computing environment. GIGAswitch is DEC`s implementation of a gigabit local area network based on switched FDDI technology. Using the GIGAswitch, the authors intend to minimize the medium access latency suffered by shared-medium FDDI technology. Experimental results show that the GIGAswitch adds 16.5 microseconds of switching and bridging delay to an end-to-end communication. Although the added latency causes a 1.8% throughput degradation and a 5% line efficiency degradation, the availability of dedicated bandwidth is much more than what is available to a workstation on a shared medium. For example, ten directly connected workstations each would have a dedicated bandwidth of 95 Mbps, but if they were sharing the FDDI bandwidth, each would have 10% of the total bandwidth, i.e., less than 10 Mbps. In addition, they have found that when there is no output port contention, the switch`s aggregate bandwidth will scale up to multiples of its port bandwidth. However, with output port contention, the throughput and latency performance suffered significantly. Their mathematical and simulation models indicate that the GIGAswitch line efficiency could be as low as 63% when there are nine input ports contending for the same output port. The data indicate that the delay introduced by contention at the server workstation is 50 times that introduced by the GIGAswitch. The authors conclude that the GIGAswitch meets the performance requirements of today`s high-end workstations and that the switched FDDI technology provides an alternative that utilizes existing workstation interfaces while increasing the aggregate bandwidth. However, because the speed of workstations is increasing by a factor of 2 every 1.5 years, the switched FDDI technology is only good as an interim solution.

The fluid dynamics research is strongly influenced by the increasing computer power which has been available for the last decades. This development is obvious from the curve in figure 1 which shows the computation cost as a function of years. It is obvious that the cost for a given job will decre...

Quinoa is a set of computational tools that enables research and numerical analysis in fluid dynamics. At this time it remains a test-bed to experiment with various algorithms using fully asynchronous runtime systems. Currently, Quinoa consists of the following tools: (1) Walker, a numerical integrator for systems of stochastic differential equations in time. It is a mathematical tool to analyze and design the behavior of stochastic differential equations. It allows the estimation of arbitrary coupled statistics and probability density functions and is currently used for the design of statistical moment approximations for multiple mixing materials in variable-density turbulence. (2) Inciter, an overdecomposition-aware finite element field solver for partial differential equations using 3D unstructured grids. Inciter is used to research asynchronous mesh-based algorithms and to experiment with coupling asynchronous to bulk-synchronous parallel code. Two planned new features of Inciter, compared to the previous release (LA-CC-16-015), to be implemented in 2017, are (a) a simple Navier-Stokes solver for ideal single-material compressible gases, and (b) solution-adaptive mesh refinement (AMR), which enables dynamically concentrating compute resources to regions with interesting physics. Using the NS-AMR problem we plan to explore how to scale such high-load-imbalance simulations, representative of large production multiphysics codes, to very large problems on very large computers using an asynchronous runtime system. (3) RNGTest, a test harness to subject random number generators to stringent statistical tests enabling quantitative ranking with respect to their quality and computational cost. (4) UnitTest, a unit test harness, running hundreds of tests per second, capable of testing serial, synchronous, and asynchronous functions. (5) MeshConv, a mesh file converter that can be used to convert 3D tetrahedron meshes from and to either of the following formats: Gmsh

This new book builds on the original classic textbook entitled: An Introduction to ComputationalFluid Mechanics by C. Y. Chow which was originally published in 1979. In the decades that have passed since this book was published the field of computationalfluid dynamics has seen a number of changes in both the sophistication of the algorithms used but also advances in the computer hardware and software available. This new book incorporates the latest algorithms in the solution techniques and supports this by using numerous examples of applications to a broad range of industries from mechanical

Issues relating to implementing iterative procedures, for numerical solution of elliptic partial differential equations, on a distributed parallelcomputing system are discussed. Preliminary investigations show that a speed-up of about 3.85 is achievable on a four transputer pipeline network. (author). 2 figs., 3 a ppendixes., 7 refs

We have constructed a parallel cluster consisting of a mixture of Apple Macintosh G3 and G4 computers running the Mac OS, and have achieved very good performance on numerically intensive, parallel plasma particle-incell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. This enables us to move parallelcomputing from the realm of experts to the main stream of computing.

This HDR is dedicated to the research in the framework of fast transient dynamics for industrial fluid-structure systems carried in the Laboratory of Dynamic Studies from CEA, implementing new numerical methods for the modelling of complex systems and the parallel solution of large coupled problems on supercomputers. One key issue for the proposed approaches is the limitation to its minimum of the number of non-physical parameters, to cope with constraints arising from the area of usage of the concepts: safety for both nuclear applications (CEA, EDF) and aeronautics (ONERA), protection of the citizen (EC/JRC) in particular. Kinematic constraints strongly coupling structures (namely through unilateral contact) or fluid and structures (with both conformant or non-conformant meshes depending on the geometrical situation) are handled through exact methods including Lagrange Multipliers, with consequences on the solution strategy to be dealt with. This latter aspect makes EPX, the simulation code where the methods are integrated, a singular tool in the community of fast transient dynamics software. The document mainly relies on a description of the modelling needs for industrial fast transient scenarios, for nuclear applications in particular, and the proposed solutions built in the framework of the collaboration between CEA, EDF (via the LaMSID laboratory) and the LaMCoS laboratory from INSA Lyon. The main considered examples are the tearing of the fluid-filled tank after impact, the Code Disruptive Accident for a Generation IV reactor or the ruin of reinforced concrete structures under impact. Innovative models and parallel algorithms are thus proposed, allowing to carry out with robustness and performance the corresponding simulations on supercomputers made of interconnected multi-core nodes, with a strict preservation of the quality of the physical solution. This was particularly the main point of the ANR RePDyn project (2010-2013), with CEA as the pilot. (author

The field of Large Eddy Simulation (LES) and hybrids is a vibrant research area. This book runs through all the potential unsteady modelling fidelity ranges, from low-order to LES. The latter is probably the highest fidelity for practical aerospace systems modelling. Cutting edge new frontiers are defined. One example of a pressing environmental concern is noise. For the accurate prediction of this, unsteady modelling is needed. Hence computational aeroacoustics is explored. It is also emerging that there is a critical need for coupled simulations. Hence, this area is also considered and the tensions of utilizing such simulations with the already expensive LES. This work has relevance to the general field of CFD and LES and to a wide variety of non-aerospace aerodynamic systems (e.g. cars, submarines, ships, electronics, buildings). Topics treated include unsteady flow techniques; LES and hybrids; general numerical methods; computational aeroacoustics; computational aeroelasticity; coupled simulations and...

All over the world sport plays a prominent role in society: as a leisure activity for many, as an ingredient of culture, as a business and as a matter of national prestige in such major events as the World Cup in soccer or the Olympic Games. Hence, it is not surprising that science has entered the realm of sports, and, in particular, that computer simulation has become highly relevant in recent years. This is explored in this book by choosing five different sports as examples, demonstrating that computational science and engineering (CSE) can make essential contributions to research on sports topics on both the fundamental level and, eventually, by supporting athletes’ performance.

We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier–Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computationalfluid dynamics (CFDs) and parallelcomputing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallelcomputing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing. (paper)

We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier-Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computationalfluid dynamics (CFDs) and parallelcomputing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallelcomputing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing.

the principle behind CFD, the development in numerical schemes and computer size since the 1970s. Special attention is given to the selection of the correct governing equations, to the understanding of low turbulent flow, to the selection of turbulence models, and to addressing situations with more steady...

The interconnection network is one of the most basic components of a massively parallelcomputer system. Such systems consist of hundreds or thousands of processors interconnected to work cooperatively on computations. One of the central problems in parallelcomputing is the task of mapping a collection of processes onto the processors and routing network of a parallel machine. Once this mapping is done, it is critical to schedule computations within and communication among processor from universities and laboratories, as well as practitioners involved in the design, implementation, and application of massively parallel systems. Focusing on interconnection networks of parallel architectures of today and of the near future , the book includes topics such as network topologies,network properties, message routing, network embeddings, network emulation, mappings, and efficient scheduling. inputs for a process are available where and when the process is scheduled to be computed. This book contains the refereed pro...

A new message passing library, Stampi, has been developed to realize a computation with different kind of parallelcomputers arbitrarily and making MPI (Message Passing Interface) as an unique interface for communication. Stampi is based on MPI2 specification. It realizes dynamic process creation to different machines and communication between spawned one within the scope of MPI semantics. Vender implemented MPI as a closed system in one parallel machine and did not support both functions; process creation and communication to external machines. Stampi supports both functions and enables us distributed parallelcomputing. Currently Stampi has been implemented on COMPACS (COMplex PArallelComputer System) introduced in CCSE, five parallelcomputers and one graphic workstation, and any communication on them can be processed on. (author)

A method for parallelcomputer calculation and OSIRIS program complex realizing it and designed for numerical plasma simulation by the macroparticle method are described. The calculation can be carried out either with one or simultaneously with two computers BESM-6, that is provided by some package of interacting programs functioning in every computer. Program interaction in every computer is based on event techniques realized in OS DISPAK. Parallelcomputer calculation with two BESM-6 computers allows to accelerate the computation 1.5 times

Design of a gas distributor to distribute gas flow into parallel channels for Solid Oxide Cells (SOC) is optimized, with respect to flow distribution, using ComputationalFluid Dynamics (CFD) modelling. The CFD model is based on a 3d geometric model and the optimized structural parameters include...... the width of the channels in the gas distributor and the area in front of the parallel channels. The flow of the optimized design is found to have a flow uniformity index value of 0.978. The effects of deviations from the assumptions used in the modelling (isothermal and non-reacting flow) are evaluated...... and it is found that a temperature gradient along the parallel channels does not affect the flow uniformity, whereas a temperature difference between the channels does. The impact of the flow distribution on the maximum obtainable conversion during operation is also investigated and the obtainable overall...

Improved techniques are provided for parallel writing of data to a shared object in a parallelcomputing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallelcomputing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).

At CCSE, Center for Promotion of Computational Science and Engineering, a new message passing library for heterogeneous and distributed parallelcomputing has been developed, and it is called as Stampi. Stampi enables us to communicate between any combination of parallelcomputers as well as workstations. Currently, a Stampi system is constructed from Stampi library and Stampi/Java. It provides functions to connect a Stampi application with not only those on COMPACS, COMplex ParallelComputer System, but also applets which work on WWW browsers. This report summarizes the specifications of Stampi and details the development of its system. (author)

Full Text Available The steady boundary layer flow and heat transfer of a viscous fluid on a moving flat plate in a parallel free stream with variable fluid properties are studied. Two special cases, namely, constant fluid properties and variable fluid viscosity, are considered. The transformed boundary layer equations are solved numerically by a finite-difference scheme known as Keller-box method. Numerical results for the flow and the thermal fields for both cases are obtained for various values of the free stream parameter and the Prandtl number. It is found that dual solutions exist for both cases when the fluid and the plate move in the opposite directions. Moreover, fluid with constant properties shows drag reduction characteristics compared to fluid with variable viscosity.

The field of ComputationalFluid Dynamics (CFD) has advanced to the point where it can now be used for many applications in fluid mechanics research and aerospace vehicle design. A few applications being explored at NASA Ames Research Center will be presented and discussed. The examples presented will range in speed from hypersonic to low speed incompressible flow applications. Most of the results will be from numerical solutions of the Navier-Stokes or Euler equations in three space dimensions for general geometry applications. Computational results will be used to highlight the presentation as appropriate. Advances in computational facilities including those associated with NASA's CAS (Computational Aerosciences) Project of the Federal HPCC (High Performance Computing and Communications) Program will be discussed. Finally, opportunities for future research will be presented and discussed. All material will be taken from non-sensitive, previously-published and widely-disseminated work.

Nanofluid is becoming a promising heat transfer fluids due to its improved thermo-physical properties and heat transfer performance. Micro channel heat transfer has potential application in the cooling high power density microchips in CPU system, micro power systems and many such miniature thermal systems which need advanced cooling capacity. Use of nanofluids enhances the effectiveness of t=scu systems. ComputationalFluid Dynamics (CFD) is a very powerful tool in computational analysis of the various physical processes. It application to the situations of flow and heat transfer analysis of the nano fluids is catching up very fast. Present research paper gives a brief account of the methodology of the CFD and also summarizes its application on nano fluid and heat transfer for microchannel cases.

The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

Understanding the dynamics and the mutual interaction among various types of vortical motions is a key ingredient in clarifying and controlling fluid motion. In the paper several different cases related to vortex tube interactions are presented. Due to problems with very long computation times on the single processor, the vortex-in-cell (VIC) method is implemented on the multicore architecture of a graphics processing unit (GPU). Numerical results of leapfrogging of two vortex rings for inviscid and viscous fluid are presented as test cases for the new multi-GPU implementation of the VIC method. Influence of the Reynolds number on the reconnection process is shown for two examples: antiparallel vortex tubes and orthogonally offset vortex tubes. Our aim is to show the great potential of the VIC method for solutions of three-dimensional flow problems and that the VIC method is very well suited for parallelcomputation. (paper)

Full Text Available In this paper, we investigate the thermal radiation effects in a time-dependent two-dimensional flow of a Casson fluid between two parallel disks when upper disk is taken to be impermeable and lower one is porous. Suitable similarity transforms are employed to convert governing partial differential equations into system of ordinary differential equations. Well known Homotopy Analysis Method (HAM is employed to obtain the expressions for velocity and temperature profiles. Effects of different physical parameters such as squeeze number $S$, Prandtl number $Pr$, Eckert number $Ec$ and the dimensionless length on the flow are also discussed with the help of graphs for velocity and temperature coupled with a comprehensive discussions. The skin friction coefficient and local Nusselt number along with convergence of the series solutions obtained by HAM are presented in tabulated form, while numerical solution is obtained by $RK-4$ method and comparison shows an excellent agreement between both the solutions.

A computer simulation is described on rough hard spheres with a continuously variable roughness parameter, including the limits of smooth and completely rough spheres. A system of 500 particles is simulated with a homogeneous mass distribution at 8 different densities and for 5 different values of the roughness parameter. For these 40 physically different situations the intermediate scattering function for 6 values of the wave number, the orientational correlation functions and the velocity autocorrelation functions have been calculated. A comparison has been made with a neutron scattering experiment on neopentane and agreement was good for an intermediate value of the roughness parameter. Some often made approximations in neutron scattering experiments are also checked. The influence of the variable roughness parameter on the correlation functions has been investigated and three simple stochastic models studied to describe the orientational correlation function which shows the most pronounced dependence on the roughness. (Auth.)

As the computational requirement of applications in computational science continues to grow tremendously, the use of computational resources distributed across the Wide Area Network (WAN) becomes advantageous. However, not all applications can be executed over the WAN due to communication overhead that can drastically slowdown the computation. In this paper, we introduce an event based simulator to investigate the performance of parallel algorithms executed over the WAN. The event based simulator known as SIMPAR (SIMulator for PARallelcomputation), simulates the actual computations and communications involved in parallelcomputation over the WAN using time stamps. Visualization of real time applications require steady stream of processed data flow for visualization purposes. Hence, SIMPAR may prove to be a valuable tool to investigate types of applications and computing resource requirements to provide uninterrupted flow of processed data for real time visualization purposes. The results obtained from the simulation show concurrence with the expected performance using the L-BSP model.

of the input parameters. Such an exploration process is however not possible if the simulation is computationally too expensive. For these cases we present in this paper a scalable computational steering approach utilizing a fast surrogate model as substitute

Then to solve ultimate pit limit problem it is required to find such a sub graph in a graph whose sum of weights will be maximal. One of the possible solutions of this problem is using genetic algorithms. We use a ... Details of implementation parallel genetic algorithm for searching open pit limits are provided. Comparison with ...

In this paper the implementation of a parallel watershed algorithm is described. The algorithm has been implemented on a Cray J932, which is a shared memory architecture with 32 processors. The watershed transform has generally been considered to be inherently sequential, but recently a few research

The search for high performance and low cost hardware and software solutions always guides the developments performed at the IEN parallelcomputing laboratory. In this context, this dissertation about the building of programs for visualization of computationalfluid dynamics (CFD) simulations using the open source software OpenDx was written. The programs developed are useful to produce videos and images in two or three dimensions. They are interactive, easily to use and were designed to serve fluid dynamics researchers. A detailed description about how this programs were developed and the complete instructions of how to use them was done. The use of OpenDx as development tool is also introduced. There are examples that help the reader to understand how programs can be useful for many applications. (author)

Hardware faults location in a data communications network of a parallelcomputer. Such a parallelcomputer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

Traditional research methodologies in the human respiratory system have always been challenging due to their invasive nature. Recent advances in medical imaging and computationalfluid dynamics (CFD) have accelerated this research. This book compiles and details recent advances in the modelling of the respiratory system for researchers, engineers, scientists, and health practitioners. It breaks down the complexities of this field and provides both students and scientists with an introduction and starting point to the physiology of the respiratory system, fluid dynamics and advanced CFD modeling tools. In addition to a brief introduction to the physics of the respiratory system and an overview of computational methods, the book contains best-practice guidelines for establishing high-quality computational models and simulations. Inspiration for new simulations can be gained through innovative case studies as well as hands-on practice using pre-made computational code. Last but not least, students and researcher...

Natural ventilation is widely recognised as contributing towards low-energy building design. The requirement to reduce energy usage in new buildings has rejuvenated interest in natural ventilation. This thesis deals with computer modelling of natural displacement ventilation driven either by buoyancy or buoyancy combined with wind forces. Two benchmarks have been developed using computationalfluid dynamics (CFD) in order to evaluate the accuracy with which CFD is able to mo...

This paper discusses a repertoire of three-dimensional computer programs developed to perform critical analysis of single-phase, two-phase and multi-fluid flow in reactor components. The basic numerical approach to solving the governing equations common to all the codes is presented and the additional constitutive relationships required for closure are discussed. Particular applications are presented for a number of computer codes. (author). 12 refs

Couette flow between parallel plates is a classical problem that has important applications in various industrial processing. In this investigation an analytical solution was obtained to predict the steady and unsteady Couette flow between parallel plates. One of the plates was stationary and the other plate moved with constant velocity. The governing partial differential equations were solved numerically using Crank-Nicolson implicit method to represent the flow behavior of the fluid

A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed. 14 refs

ComputationalFluid Dynamics simulation of air flow distribution, air velocity and pressure field pattern as it will affect moisture transient in a cabinet tray dryer is performed using SolidWorks Flow Simulation (SWFS) 2014 SP 4.0 program. The model used for the drying process in this experiment was designed with Solid ...

An interconnection between a building energy performance simulation program and a ComputationalFluid Dynamics program (CFD) for room air distribution will be introduced for improvement of the predictions of both the energy consumption and the indoor environment. The building energy performance...

This paper presents a numerical model that by means of computationalfluid dynamics (CFD) is capable of dealing with both pollutant transport across the boundary layer and internal diffusion in the source without prior knowledge of which is the limiting process. The model provides the concentration...

Computer code has been developed to provide thermodynamic and transport properties of liquid argon, carbon dioxide, carbon monoxide, fluorine, helium, methane, neon, nitrogen, oxygen, and parahydrogen. Equation of state and transport coefficients are updated and other fluids added as new material becomes available.

engineering computationalfluid dynamics (CFD) simulation program ANSYS CFX and a CFD based representative program RealFlow are investigated. These two programs represent two types of CFD based tools available for use during phases of an architectural design process. However, as outlined in two case studies...

A computer program has been developed for fluid hammer analysis of piping systems attached to a vessel which has undergone a known rapid pressure transient. The program is based on the characteristics method for solution of the partial differential equations of motion and continuity. Column separation logic is included for situations in which pressures fall to saturation values

First, an attempt is made towards gaining a more systematic understanding of recent progress in multiscale modelling in computational solid and fluid mechanics. Sub- sequently, the discussion is focused on variational multiscale methods for the compressible and incompressible Navier-Stokes

In this work, we present a multi-camera surveillance system based on the use of self-organizing neural networks to represent events on video. The system processes several tasks in parallel using GPUs (graphic processor units). It addresses multiple vision tasks at various levels, such as segmentation, representation or characterization, analysis and monitoring of the movement. These features allow the construction of a robust representation of the environment and interpret the behavior of mob...

Full Text Available Compared to the hydrostatic hydrodynamic model, the non-hydrostatic hydrodynamic model can accurately simulate flows that feature vertical accelerations. The model’s low computational efficiency severely restricts its wider application. This paper proposes a non-hydrostatic hydrodynamic model based on a multithreading parallelcomputing method. The horizontal momentum equation is obtained by integrating the Navier–Stokes equations from the bottom to the free surface. The vertical momentum equation is approximated by the Keller-box scheme. A two-step method is used to solve the model equations. A parallel strategy based on block decomposition computation is utilized. The original computational domain is subdivided into two subdomains that are physically connected via a virtual boundary technique. Two sub-threads are created and tasked with the computation of the two subdomains. The producer–consumer model and the thread lock technique are used to achieve synchronous communication between sub-threads. The validity of the model was verified by solitary wave propagation experiments over a flat bottom and slope, followed by two sinusoidal wave propagation experiments over submerged breakwater. The parallelcomputing method proposed here was found to effectively enhance computational efficiency and save 20%–40% computation time compared to serial computing. The parallel acceleration rate and acceleration efficiency are approximately 1.45% and 72%, respectively. The parallelcomputing method makes a contribution to the popularization of non-hydrostatic models.

To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallelcomputational framework, high

Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navier-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple workstations into a network-based heterogeneous parallelcomputer allows the application of programming principles learned on MIMD (Multiple Instruction Multiple Data) distributed memory parallelcomputers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallelcomputer. Inter-process communication is provided by the Parallel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal impediments to parallel efficiency in this application.

A multigrid technique useful for accelerating the convergence of Euler and Navier-Stokes flow computations has been restructured to improve its performance on both SIMD and MIMD computers. The new algorithm allows both the construction of longer coarse-grid vectors and the multitasking of entire grids. Computational results are presented for the CDC Cyber 205, Cray X-MP, and Denelcor HEP I. 15 references

The advent of Cloud and the popularization of mobile devices have led us to a shift in computing access. Computing users will have an interaction display while the real computation will be performed remotely, in the Cloud. COMPSs-Mobile is a framework that aims to ease the development of energy-efficient and high-performing applications for this environment. The framework provides an infrastructure-unaware programming model that allows developers to code regular Android applications that, ...

Subsequent to the introduction of High Performance Computing in the developed countries, the Organization for Economic Cooperation and Development/Nuclear Energy Agency (OECD/NEA) created the Task Force on Adapting Computer Codes in Nuclear Applications to Parallel Architectures (under the guidance of the Nuclear Science Committee`s Working Party on Advanced Computing) to study the growth area in supercomputing and its applicability to the nuclear community`s computer codes. The result has been four years of investigation for the Task Force in different subject fields - deterministic and Monte Carlo radiation transport, computational mechanics and fluid dynamics, nuclear safety, atmospheric models and waste management.

Subsequent to the introduction of High Performance Computing in the developed countries, the Organization for Economic Cooperation and Development/Nuclear Energy Agency (OECD/NEA) created the Task Force on Adapting Computer Codes in Nuclear Applications to Parallel Architectures (under the guidance of the Nuclear Science Committee's Working Party on Advanced Computing) to study the growth area in supercomputing and its applicability to the nuclear community's computer codes. The result has been four years of investigation for the Task Force in different subject fields - deterministic and Monte Carlo radiation transport, computational mechanics and fluid dynamics, nuclear safety, atmospheric models and waste management

This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallelcomputer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of

Numerical simulation allows the theorists to convince themselves about the validity of the models they use. Particularly by simulating the spin lattices one can judge about the validity of a conjecture. Simulating a system defined by a large number of degrees of freedom requires highly sophisticated machines. This study deals with modelling the magnetic interactions between the ions of a crystal. Many exact results have been found for spin 1/2 systems but not for systems of other spins for which many simulation have been carried out. The interest for simulations has been renewed by the Haldane`s conjecture stipulating the existence of a energy gap between the ground state and the first excited states of a spin 1 lattice. The existence of this gap has been experimentally demonstrated. This report contains the following four chapters: 1. Spin systems; 2. Calculation of eigenvalues; 3. Programming; 4. Parallel calculation 14 refs., 6 figs.

A kind of video images digitizing circuit based on parallel port was developed to digitize the flash x ray images in our Multi-Channel Digital Flash X ray Imaging System. The circuit can digitize the video images and store in static memory. The digital images can be transferred to computer through parallel port and can be displayed, processed and stored. (authors)

Full Text Available This article deals with mathematical models and algorithms, providing mobility of sequential programs parallel representation on the high-level language, presents formal model of operation environment processes management, based on the proposed model of programs parallel representation, presenting computation process on the base of multi-core processors.

Thermal-hydraulic analyses play an important role in design and reload analysis of nuclear power plants. These analyses have historically relied on early generation computationalfluid dynamics capabilities, originally developed in the 1960s and 1970s. Over the last twenty years, however, dramatic improvements in both computationalfluid dynamics codes in the commercial sector and in computing power have taken place. These developments offer the possibility of performing large scale, high fidelity, core thermal hydraulics analysis. Such analyses will allow a determination of the conservatism employed in traditional design approaches and possibly justify the operation of nuclear power systems at higher powers without compromising safety margins. The objective of this work is to demonstrate such a large scale analysis approach using a state of the art CFD code, STAR-CD, and the computing power of massively parallelcomputers, provided by IBM. A high fidelity representation of a current generation PWR was analyzed with the STAR-CD CFD code and the results were compared to traditional analyses based on the VIPRE code. Current design methodology typically involves a simplified representation of the assemblies, where a single average pin is used in each assembly to determine the hot assembly from a whole core analysis. After determining this assembly, increased refinement is used in the hot assembly, and possibly some of its neighbors, to refine the analysis for purposes of calculating DNBR. This latter calculation is performed with sub-channel codes such as VIPRE. The modeling simplifications that are used involve the approximate treatment of surrounding assemblies and coarse representation of the hot assembly, where the subchannel is the lowest level of discretization. In the high fidelity analysis performed in this study, both restrictions have been removed. Within the hot assembly, several hundred thousand to several million computational zones have been used, to

In this paper, a Morphing-based Shape Optimization (MbSO) technique is presented for solving Optimum-Shape Design (OSD) problems in ComputationalFluid Dynamics (CFD). The proposed method couples Free-Form Deformation (FFD) and Evolutionary Computation, and, as its name suggests, relies on the morphing of shape and computational domain, rather than direct shape parameterization. Advantages of the FFD approach compared to traditional parameterization are first discussed. Then, examples of shape and grid deformations by FFD are presented. Finally, the MbSO approach is illustrated and applied through an example: the design of an airfoil for a future Mars exploration airplane.

Several approaches for improving the teaching of basic fluid mechanics using computers are presented. There are two objectives to these approaches: to increase the involvement of the student in the learning process and to present information to the student in a variety of forms. Items discussed include: the preparation of educational videos using the results of computationalfluid dynamics (CFD) calculations, the analysis of CFD flow solutions using workstation based post-processing graphics packages, and the development of workstation or personal computer based simulators which behave like desk top wind tunnels. Examples of these approaches are presented along with observations from working with undergraduate co-ops. Possible problems in the implementation of these approaches as well as solutions to these problems are also discussed.

Two years ago, researchers at Sandia National Laboratories showed that a massively parallelcomputer with 1024 processors could solve scientific problems more than 1000 times faster than a single processor...

This paper derives theoretical estimates of the computational cost for isogeometric multi-frontal direct solver executed on parallel distributed memory machines. We show theoretically that for the Cp-1 global continuity of the isogeometric solution

The objective of the present study was to deepen the knowledge about the functioning of the neural circuits by implementing a signal transmission model using the Graph Theory in a small network of neurons composed of an interconnected reverberant and parallel circuit, in order to investigate the processing of the signals in each of them and the effects on the output of the network. For this, a program was developed in C language and simulations were done using neurophysiological data obtained in the literature.

The fluid dynamics for turbulent flow through rod bundles representative of those used in pressurized water reactors is examined using computationalfluid dynamics (CFD). The rod bundles of the pressurized water reactor examined in this study consist of a square array of parallel rods that are held on a constant pitch by support grids spaced axially along the rod bundle. Split-vane pair support grids are often used to create swirling flow in the rod bundle in an effort to improve the heat transfer characteristics for the rod bundle during both normal operating conditions and in accident condition scenarios. Computationalfluid dynamics simulations for a two subchannel portion of the rod bundle were used to model the flow downstream of a split-vane pair support grid. A high quality computational mesh was used to investigate the choice of turbulence model appropriate for the complex swirling flow in the rod bundle subchannels. Results document a central swirling flow structure in each of the subchannels downstream of the split-vane pairs. Strong lateral flows along the surface of the rods, as well as impingement regions of lateral flow on the rods are documented. In addition, regions of lateral flow separation and low axial velocity are documented next to the rods. Results of the CFD are compared to experimental particle image velocimetry (PIV) measurements documenting the lateral flow structures downstream of the split-vane pairs. Good agreement is found between the computational simulation and experimental measurements for locations close to the support grid. (authors)

Paging memory from random access memory (`RAM`) to backing storage in a parallelcomputer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.

In this paper the authors report on efforts to utilize parallelcomputer architectures for the thermal-hydraulic simulation of nuclear power systems and current research efforts toward the development of advanced reactor operator aids and control systems based on this new technology. Many aspects of reactor thermal-hydraulic calculations are inherently parallel, and the computationally intensive portions of these calculations can be effectively implemented on modern computers. Timing studies indicate faster-than-real-time, high-fidelity physics models can be developed when the computational algorithms are designed to take advantage of the computer's architecture. These capabilities allow for the development of novel control systems and advanced reactor operator aids. Coupled with an integral real-time data acquisition system, evolving parallelcomputer architectures can provide operators and control room designers improved control and protection capabilities. Current research efforts are currently under way in this area

At the Center for Promotion of Computational Science and Engineering, a software environment PPExe has been developed to support scientific computing on a parallelcomputer cluster (distributed parallel scientific computing). TME (Task Mapping Editor) is one of components of the PPExe and provides a visual programming environment for distributed parallel scientific computing. Users can specify data dependence among tasks (programs) visually as a data flow diagram and map these tasks onto computers interactively through GUI of TME. The specified tasks are processed by other components of PPExe such as Meta-scheduler, RIM (Resource Information Monitor), and EMS (Execution Management System) according to the execution order of these tasks determined by TME. In this report, we describe the usage of TME. (author)

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.

Methods, apparatus, and products are disclosed for line-plane broadcasting in a data communications network of a parallelcomputer, the parallelcomputer comprising a plurality of compute nodes connected together through the network, the network optimized for point to point data communications and characterized by at least a first dimension, a second dimension, and a third dimension, that include: initiating, by a broadcasting compute node, a broadcast operation, including sending a message to all of the compute nodes along an axis of the first dimension for the network; sending, by each compute node along the axis of the first dimension, the message to all of the compute nodes along an axis of the second dimension for the network; and sending, by each compute node along the axis of the second dimension, the message to all of the compute nodes along an axis of the third dimension for the network.

We will review some of our recent experiences in computing solutions for nonlinear fluids in relatively simple, two-dimensional geometries. The purpose of this discussion will be to display by example some of the interesting but difficult questions that arise when ill-behaved solutions are obtained numerically. We will consider two examples. As the first example, we will consider a nonlinear elastic (compressible) fluid with chemical reactions and discuss solutions for detonation and detonation failure in a two-dimensional cylinder. In this case, the numerical algorithm utilizes a finite-difference method with artificial viscosity (von Neumann-Richtmyer method) and leads to two, distinctly different, stable solutions depending on the time step criterion used. The second example to be considered involves the convection of a viscous fluid in a rectangular container as a result of an exothermic polymerization reaction. A solidification front develops near the top of the container and propagates down through the fluid, changing the aspect ratio of the region ahead of the front. Using a Galerkin-based finite element method, a numerical solution of the partial differential equations is obtained which tracks the front and correctly predicts the fluid temperatures near the walls. However, the solution also exhibits oscillatory behavior with regard to the number of cells in the fluid ahead of the front and in the strength of the cells. More definitive experiments and analysis are required to determine whether this oscillatory phenomena is a numerical artifact or a physical reality. 20 refs., 14 figs

Climate modeling is one of the grand challenges of computational science, and ocean modeling plays an important role in both understanding the current climatic conditions and predicting future climate change.

Full Text Available In this work, we present a multi-camera surveillance system based on the use of self-organizing neural networks to represent events on video. The system processes several tasks in parallel using GPUs (graphic processor units. It addresses multiple vision tasks at various levels, such as segmentation, representation or characterization, analysis and monitoring of the movement. These features allow the construction of a robust representation of the environment and interpret the behavior of mobile agents in the scene. It is also necessary to integrate the vision module into a global system that operates in a complex environment by receiving images from multiple acquisition devices at video frequency. Offering relevant information to higher level systems, monitoring and making decisions in real time, it must accomplish a set of requirements, such as: time constraints, high availability, robustness, high processing speed and re-configurability. We have built a system able to represent and analyze the motion in video acquired by a multi-camera network and to process multi-source data in parallel on a multi-GPU architecture.

One of the most noticeable effects of air pollution on the properties of the atmosphere is the reduction in visibility. This paper reports the results of investigations of the fluid dynamical and microphysical processes involved in the formation of advection fog on aerosols from combustion-related pollutants, as condensation nuclei. The effects of a polydisperse aerosol distribution, on the condensation/nucleation processes which cause the reduction in visibility are studied. This study demonstrates how computationalfluid mechanics and heat transfer modeling can be applied to simulate the life cycle of the atmosphereic pollution problems.

Computational simulation is an essential tool for the prediction of fluid flow. Many powerful simulation programs exist today. However, using these programs to reliably analyze fluid flow and other physical situations requires considerable human effort and expertise to set up a simulation, determine whether the output makes sense, and repeatedly run the simulation with different inputs until a satisfactory result is achieved. Automating this process is not only of considerable practical importance but will also significantly advance basic artificial intelligence (AI) research in reasoning about the physical world.

A symposium on the technology and applications of computationalfluid dynamics (CFD) was held in Pretoria from 21-23 Nov 1988. The following aspects were covered: multilevel adaptive methods and multigrid solvers in CFD, a symbolic processing approach to CFD, interplay between CFD and analytical approximations, CFD on a transfer array, the application of CFD in high speed aerodynamics, numerical simulation of laminar blood flow, two-phase flow modelling in nuclear accident analysis, and the finite difference scheme for the numerical solution of fluid flow

We extend the transfer matrix method of one-dimensional hard core fluids placed between confining walls for that case where the particles can pass each other and at most two layers can form. We derive an eigenvalue equation for a quasi-one-dimensional system of hard squares confined between two parallel walls, where the pore width is between σ and 3σ (σ is the side length of the square). The exact equation of state and the nearest neighbor distribution functions show three different structures: a fluid phase with one layer, a fluid phase with two layers, and a solid-like structure where the fluid layers are strongly correlated. The structural transition between differently ordered fluids develops continuously with increasing density, i.e., no thermodynamic phase transition occurs. The high density structure of the system consists of clusters with two layers which are broken with particles staying in the middle of the pore

Current fire simulation systems are capable to utilize advantages of high-performance computer (HPC) platforms available and to model fires efficiently in parallel. In this paper, efficiency of a corridor fire simulation on a HPC computer cluster is discussed. The parallel MPI version of Fire Dynamics Simulator is used for testing efficiency of selected strategies of allocation of computational resources of the cluster using a greater number of computational cores. Simulation results indicate that if the number of cores used is not equal to a multiple of the total number of cluster node cores there are allocation strategies which provide more efficient calculations.

The aim of the PHAETON project is the design of an on-line computer in order to increase the immediate knowledge of the main operating and safety parameters in power plants. A significant stage is the computation of the three dimensional flux distribution. For cost and safety reason a computer based on a parallel microprocessor architecture has been studied. This paper presents a first approach to parallelized three dimensional diffusion calculation. A computing software has been written and built in a four processors demonstrator. We present the realization in progress, concerning the final equipment. 8 refs

The present study describes modifications and additions to the MIPROPS computer program for calculating the thermophysical properties of 17 fluids. These changes include adding new fluids, new properties, and a new interface to the program. The new program allows the user to select the input and output parameters and the units to be displayed for each parameter. Fluids added to the MIPROPS program are carbon dioxide, carbon monoxide, deuterium, helium, normal hydrogen, and xenon. The most recent modifications to the MIPROPS program are the addition of viscosity and thermal conductivity correlations for parahydrogen and the addition of the fluids normal hydrogen and xenon. The recently added interface considerably increases the program's utility.

Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallelcomputer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

Digital tomosynthesis is a novel technology that has been developed for various clinical applications. Parallel imaging configuration is utilised in a few tomosynthesis imaging areas such as digital chest tomosynthesis. Recently, parallel imaging configuration for breast tomosynthesis began to appear too. In this paper, we present the investigation on computational analysis of impulse response characterisation as the start point of our important research efforts to optimise the parallel imaging configurations. Results suggest that impulse response computational analysis is an effective method to compare and optimise imaging configurations.

Although there are numerous algorithms for solving Riccati equations, there still remains a need for algorithms which can operate efficiently on large problems and on parallel machines. This paper gives a new homotopy-based algorithm for solving Riccati equations on a shared memory parallelcomputer. The central part of the algorithm is the computation of the kernel of the Jacobian matrix, which is essential for the corrector iterations along the homotopy zero curve. Using a Schur decomposition the tensor product structure of various matrices can be efficiently exploited. The algorithm allows for efficient parallelization on shared memory machines

With the advantage of fine spectral resolution, hyperspectral imagery provides great potential for cell classification. This paper provides a promising classification system including the following three stages: (1) band selection for a subset of spectral bands with distinctive and informative features, (2) spectral-spatial feature extraction, such as local binary patterns (LBP), and (3) followed by an effective classifier. Moreover, these three steps are further implemented on graphics processing units (GPU) respectively, which makes the system real-time and more practical. The GPU parallel implementation is compared with the serial implementation on central processing units (CPU). Experimental results based on real medical hyperspectral data demonstrate that the proposed system is able to offer high accuracy and fast speed, which are appealing for cell classification in medical hyperspectral imagery. (paper)

A parallelcomputer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallelcomputer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.

This thesis presents new algorithms for low and intermediate level computer vision. The guiding ideas in the presented approach are those of hierarchical and adaptive processing, concurrent computation, and supervised learning. Processing of the visual data at different resolutions is used not only to reduce the amount of computation necessary to reach the fixed point, but also to produce a more accurate estimation of the desired parameters. The presented adaptive multiple scale technique is applied to the problem of motion field estimation. Different parts of the image are analyzed at a resolution that is chosen in order to minimize the error in the coefficients of the differential equations to be solved. Tests with video-acquired images show that velocity estimation is more accurate over a wide range of motion with respect to the homogeneous scheme. In some cases introduction of explicit discontinuities coupled to the continuous variables can be used to avoid propagation of visual information from areas corresponding to objects with different physical and/or kinematic properties. The human visual system uses concurrent computation in order to process the vast amount of visual data in "real -time." Although with different technological constraints, parallelcomputation can be used efficiently for computer vision. All the presented algorithms have been implemented on medium grain distributed memory multicomputers with a speed-up approximately proportional to the number of processors used. A simple two-dimensional domain decomposition assigns regions of the multiresolution pyramid to the different processors. The inter-processor communication needed during the solution process is proportional to the linear dimension of the assigned domain, so that efficiency is close to 100% if a large region is assigned to each processor. Finally, learning algorithms are shown to be a viable technique to engineer computer vision systems for different applications starting from

The wiping, or doctoring, process in gravure printing presents a fundamental barrier to resolving the micron-sized features desired in printed electronics applications. This barrier starts with the residual fluid film left behind after wiping, and its importance grows as feature sizes are reduced, especially as the feature size approaches the thickness of the residual fluid film. In this work, various mechanical complexities are considered in a computational model developed to predict the residual fluid film thickness. Lubrication models alone are inadequate, and deformation of the doctor blade body together with elastohydrodynamic lubrication must be considered to make the model predictive of experimental trends. Moreover, model results demonstrate that the particular form of the wetted region of the blade has a significant impact on the model's ability to reproduce experimental measurements.

The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation...... of the skyline. However, they do not sufficiently minimize dominance tests and so are not competitive with state-of-the-art sequential algorithms. In this paper, we introduce a novel multi-core skyline algorithm, Hybrid, which processes points in blocks. It maintains a shared, global skyline among all threads...

The current state of the art in computational aerodynamics for whole-body aircraft flowfield simulations is described. Recent advances in geometry modeling, surface and volume grid generation, and flow simulation algorithms have led to accurate flowfield predictions for increasingly complex and realistic configurations. As a result, computational aerodynamics has emerged as a crucial enabling technology for the design and development of flight vehicles. Examples illustrating the current capability for the prediction of transport and fighter aircraft flowfields are presented. Unfortunately, accurate modeling of turbulence remains a major difficulty in the analysis of viscosity-dominated flows. In the future, inverse design methods, multidisciplinary design optimization methods, artificial intelligence technology, and massively parallelcomputer technology will be incorporated into computational aerodynamics, opening up greater opportunities for improved product design at substantially reduced costs.

A new set of two-fluid equations which are valid from collisional to weakly collisional limits are derived. Starting from gyrokinetic equations in flux coordinates with no zeroth order drifts, a set of moment equations describing plasma transport along the field lines of a space and time dependent magnetic field are derived. No restriction on the anisotropy of the ion distribution function is imposed. In the highly collisional limit, these equations reduce to those of Braginskii while in the weakly collisional limit, they are similar to the double adiabatic or Chew, Goldberger, and Low (CGL) equations. The new transport equations are used to study the effects of collisionality, magnetic field structure, and plasma anisotropy on plasma parallel transport. Numerical examples comparing these equations with conventional transport equations show that the conventional equations may contain large errors near the sound speed (M approx. = 1). It is also found that plasma anisotropy, which is not included in the conventional equations, is a critical parameter in determining plasma transport in varying magnetic field. The new transport equations are also used to study axial confinement in multiple mirror devices from the strongly to weakly collisional regime. A new ion conduction model was worked out to extend the regime of validity of the transport equations to the low density multiple mirror regime

Simulation of physical and chemical processes occurring in the nanomaterial production technologies is a computationally challenging problem, due to the great number of coupled processes, time and length scales to be taken into account. To solve such complex problems with a good level of detail in a

The general purpose particle transport Monte Carlo code, MORSE-C.G., is implemented on a parallelcomputing transputer-based system having MIMD architecture. Example problems are solved which are representative of the 3-principal types of problem that can be solved by the original serial code, namely, fixed source, eigenvalue (k-eff) and time-dependent. The results from the parallelized version of the code are compared in tables with the serial code run on a mainframe serial computer, and with an independent, deterministic transport code. The performance of the parallelcomputer as the number of processors is varied is shown graphically. For the parallel strategy used, the loss of efficiency as the number of processors is increased, is investigated. (author)

The modeling of multiphase fluid flow in porous medium is of interest in the field of reservoir simulation. The promising numerical methods in the literature are mostly based on the explicit or semi-implicit approach, which both have certain stability restrictions on the time step size. In this work, we introduce and study a scalable fully implicit solver for the simulation of two-phase flow in a porous medium with capillarity, gravity and compressibility, which is free from the limitations of the conventional methods. In the fully implicit framework, a mixed finite element method is applied to discretize the model equations for the spatial terms, and the implicit Backward Euler scheme with adaptive time stepping is used for the temporal integration. The resultant nonlinear system arising at each time step is solved in a monolithic way by using a Newton–Krylov type method. The corresponding linear system from the Newton iteration is large sparse, nonsymmetric and ill-conditioned, consequently posing a significant challenge to the fully implicit solver. To address this issue, the family of additive Schwarz preconditioners is taken into account to accelerate the convergence of the linear system, and thereby improves the robustness of the outer Newton method. Several test cases in one, two and three dimensions are used to validate the correctness of the scheme and examine the performance of the newly developed algorithm on parallelcomputers.

The modeling of multiphase fluid flow in porous medium is of interest in the field of reservoir simulation. The promising numerical methods in the literature are mostly based on the explicit or semi-implicit approach, which both have certain stability restrictions on the time step size. In this work, we introduce and study a scalable fully implicit solver for the simulation of two-phase flow in a porous medium with capillarity, gravity and compressibility, which is free from the limitations of the conventional methods. In the fully implicit framework, a mixed finite element method is applied to discretize the model equations for the spatial terms, and the implicit Backward Euler scheme with adaptive time stepping is used for the temporal integration. The resultant nonlinear system arising at each time step is solved in a monolithic way by using a Newton–Krylov type method. The corresponding linear system from the Newton iteration is large sparse, nonsymmetric and ill-conditioned, consequently posing a significant challenge to the fully implicit solver. To address this issue, the family of additive Schwarz preconditioners is taken into account to accelerate the convergence of the linear system, and thereby improves the robustness of the outer Newton method. Several test cases in one, two and three dimensions are used to validate the correctness of the scheme and examine the performance of the newly developed algorithm on parallelcomputers.

Conventional image reconstruction uses simplified physical models of projection. However, real physics, for example 3D reconstruction, takes too long time to process all the data in clinic and is unable in a common reconstruction machine because of the large memory for complex physical models. We suggest the realistic distributed memory model of fast-reconstruction using parallel processing on personal computers to enable large-scale technologies. The preliminary tests for the possibility on virtual machines and various performance test on commercial super computer, Tachyon were performed. Expectation maximization algorithm with common 2D projection and realistic 3D line of response were tested. Since the process time was getting slower (max 6 times) after a certain iteration, optimization for compiler was performed to maximize the efficiency of parallelization. Parallel processing of a program on multiple computers was available on Linux with MPICH and NFS. We verified that differences between parallel processed image and single processed image at the same iterations were under the significant digits of floating point number, about 6 bit. Double processors showed good efficiency (1.96 times) of parallelcomputing. Delay phenomenon was solved by vectorization method using SSE. Through the study, realistic parallelcomputing system in clinic was established to be able to reconstruct by plenty of memory using the realistic physical models which was impossible to simplify

A new message passing library, Stampi, has been developed to realize a computation with different kind of parallelcomputers arbitrarily and making MPI (Message Passing Interface) as an unique interface for communication. Stampi is based on the MPI2 specification, and it realizes dynamic process creation to different machines and communication between spawned one within the scope of MPI semantics. Main features of Stampi are summarized as follows: (i) an automatic switch function between external- and internal communications, (ii) a message routing/relaying with a routing module, (iii) a dynamic process creation, (iv) a support of two types of connection, Master/Slave and Client/Server, (v) a support of a communication with Java applets. Indeed vendors implemented MPI libraries as a closed system in one parallel machine or their systems, and did not support both functions; process creation and communication to external machines. Stampi supports both functions and enables us distributed parallelcomputing. Currently Stampi has been implemented on COMPACS (COMplex PArallelComputer System) introduced in CCSE, five parallelcomputers and one graphic workstation, moreover on eight kinds of parallel machines, totally fourteen systems. Stampi provides us MPI communication functionality on them. This report describes mainly the usage of Stampi. (author)

Karlsruhe Institute of Technology (KIT) is developing the parallelcomputationalfluid dynamics code GASFLOW-MPI as a best-estimate tool for predicting transport, mixing, and combustion of hydrogen and other gases in nuclear reactor containments and other facility buildings. GASFLOW-MPI is a finite-volume code based on proven computationalfluid dynamics methodology that solves the compressible Navier-Stokes equations for three-dimensional volumes in Cartesian or cylindrical coordinates.

A system and method for generating fluid flow parameter data for use in aerodynamic heating analysis. Computationalfluid dynamics data is generated for a number of points in an area on a surface to be analyzed. Sub-areas corresponding to areas of the surface for which an aerodynamic heating analysis is to be performed are identified. A computer system automatically determines a sub-set of the number of points corresponding to each of the number of sub-areas and determines a value for each of the number of sub-areas using the data for the sub-set of points corresponding to each of the number of sub-areas. The value is determined as an average of the data for the sub-set of points corresponding to each of the number of sub-areas. The resulting parameter values then may be used to perform an aerodynamic heating analysis.

Implementation of two distributed graphics programs used in computationalfluid dynamics is discussed. Both programs are interactive in nature. They run on a CRAY-2 supercomputer and use a Silicon Graphics Iris workstation as the front-end machine. The hardware and supporting software are from the Numerical Aerodynamic Simulation project. The supercomputer does all numerically intensive work and the workstation, as the front-end machine, allows the user to perform real-time interactive transformations on the displayed data. The first program was written as a distributed program that computes particle traces for fluid flow solutions existing on the supercomputer. The second is an older post-processing and plotting program modified to run in a distributed mode. Both programs have realized a large increase in speed over that obtained using a single machine. By using these programs, one can learn quickly about complex features of a three-dimensional flow field. Some color results are presented

This article provides a brief description of the safety principle on which liquid metal cooled fast breeder reactors (LMFBRs) is based and the roles of computations in the safety practices. A number of thermohydraulics models have been developed to date that successfully describe several of the important types of fluids and materials motion encountered in the analysis of postulated accidents in LMFBRs. Most of these models use a mixture of implicit and explicit numerical solution techniques in solving a set of conservation equations formulated in Eulerian coordinates, with special techniques included to specific situations. Typical computational thermo-fluid dynamics approaches are discussed in particular areas of analyses of the physical phenomena relevant to the fuel subassembly thermohydraulics design and that involve describing the motion of molten materials in the core over a large scale. (orig.)

A new cardioplegia heat exchanger has been developed by Sorin Biomedica. A three-dimensional computer-aided design (CAD) model was optimized using computationalfluid dynamics (CFD) modelling. CFD optimization techniques have commonly been applied to velocity flow field analysis, but CFD analysis was also used in this study to predict the heat exchange performance of the design before prototype fabrication. The iterative results of the optimization and the actual heat exchange performance of the final configuration are presented in this paper. Based on the behaviour of this model, both the water and blood fluid flow paths of the heat exchanger were optimized. The simulation predicted superior heat exchange performance using an optimal amount of energy exchange surface area, reducing the total contact surface area, the device priming volume and the material costs. Experimental results confirm the empirical results predicted by the CFD analysis.

In real time simulation/prediction of complex systems such as water-cooled nuclear reactors, if reactor operators had fast simulator/predictors to check the consequences of their operations before implementing them, events such as the incident at Three Mile Island might be avoided. However, existing simulator/predictors such as RELAP run slower than real time on serial computers. It appears that the only way to overcome the barrier to higher computing rates is to use computers with architectures that allow concurrent computations or parallel processing. The computer architecture with the greatest degree of parallelism is labeled Multiple Instruction Stream, Multiple Data Stream (MIMD). An example of a machine of this type is the HEP computer by DENELCOR. It appears that hydrocodes are very well suited for parallelization on the HEP. It is a straightforward exercise to parallelize explicit, one-dimensional Lagrangean hydrocodes in a zone-by-zone parallelization. Similarly, implicit schemes can be parallelized in a zone-by-zone fashion via an a priori, symbolic inversion of the tridiagonal matrix that arises in an implicit scheme. These techniques are extended to Eulerian hydrocodes by using Harlow's rezone technique. The extension from single-phase Eulerian to two-phase Eulerian is straightforward. This step-by-step extension leads to hydrocodes with zone-by-zone parallelization that are capable of two-phase flow simulation. Extensions to two and three spatial dimensions can be achieved by operator splitting. It appears that a zone-by-zone parallelization is the best way to utilize the capabilities of an MIMD machine. 40 references

Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallelcomputer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention

This paper deals with the parallel implementations of reconstruction methods in 3D tomography. 3D tomography requires voluminous data and long computation times. Parallelcomputing, on MIMD computers, seems to be a good approach to manage this problem. In this study, we present the different steps of the parallelization on an abstract parallelcomputer. Depending on the method, we use two main approaches to parallelize the algorithms: the local approach and the global approach. Experimental results on MIMD computers are presented. Two 3D images reconstructed from realistic data are showed

Computationalfluid dynamics (CFD) is used routinely to predict air movement and distributions of temperature and concentrations in indoor environments. Modelling and numerical errors are inherent in such studies and must be considered when the results are presented. Here, we discuss modelling as...... the quality of CFD calculations, as well as guidelines for the minimum information that should accompany all CFD-related publications to enable a scientific judgment of the quality of the study....

Fluid-structure interaction problems in nuclear engineering are categorized according to the dominant physical phenomena and the appropriate computational methods. Linear fluid models that are considered include acoustic fluids, incompressible fluids undergoing small disturbances, and small amplitude sloshing. Methods available in general-purpose codes for these linear fluid problems are described. For nonlinear fluid problems, the major features of alternative computational treatments are reviewed; some special-purpose and multipurpose computer codes applicable to these problems are then described. For illustration, some examples of nuclear reactor problems that entail coupled fluid-structure analysis are described along with computational results

Computationalfluid dynamics (CFD) can provide valuable input to the development of advanced plant analysis computer codes. The types of interfacing discussed in this paper will directly contribute to modeling and accuracy improvements throughout the plant system and should result in significant reduction of design conservatisms that have been applied to such analyses in the past

The CRAY multitasking system was developed in order to utilize all four processors and sharply reduce the wall clock run time. This paper describes the techniques used to modify the computationalfluid dynamics code ARC3D for this run and analyzes the achieved speedup. The ARC3D code solves either the Euler or thin-layer N-S equations using an implicit approximate factorization scheme. Results indicate that multitask processing can be used to achieve wall clock speedup factors of over three times, depending on the nature of the program code being used. Multitasking appears to be particularly advantageous for large-memory problems running on multiple CPU computers.

We describe an implementation of QR up- and downdating on a massively parallelcomputer (the Connection Machine CM-200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected semi-normal equations for downdating can be efficiently implemented. We...... also illustrate the use of our algorithms in a new LP algorithm....

Full Text Available The growth in the use of computationally intensive statistical procedures, especially with big data, has necessitated the usage of parallelcomputation on diverse platforms such as multicore, GPUs, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP algorithms, especially on non-shared memory platforms. This paper develops a broadlyapplicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.

Full Text Available Neuroimage registration is crucial for brain morphometric analysis and treatment efficacy evaluation. However, existing advanced registration algorithms such as FLIRT and ANTs are not efficient enough for clinical use. In this paper, a GPU implementation of FLIRT with the correlation ratio (CR as the similarity metric and a GPU accelerated correlation coefficient (CC calculation for the symmetric diffeomorphic registration of ANTs have been developed. The comparison with their corresponding original tools shows that our accelerated algorithms can greatly outperform the original algorithm in terms of computational efficiency. This paper demonstrates the great potential of applying these registration tools in clinical applications.

The GAMM Committee for Numerical Methods in Fluid Mechanics organizes workshops which should bring together experts of a narrow field of computationalfluid dynamics (CFD) to exchange ideas and experiences in order to speed-up the development in this field. In this sense it was suggested that a workshop should treat the solution of CFD problems on vector computers. Thus we organized a workshop with the title "The efficient use of vector computers with emphasis on computationalfluid dynamics". The workshop took place at the Computing Centre of the University of Karlsruhe, March 13-15,1985. The participation had been restricted to 22 people of 7 countries. 18 papers have been presented. In the announcement of the workshop we wrote: "Fluid mechanics has actively stimulated the development of superfast vector computers like the CRAY's or CYBER 205. Now these computers on their turn stimulate the development of new algorithms which result in a high degree of vectorization (sca1ar/vectorized execution-time). But w...

In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.

Full Text Available Abstract - The availability of multi-core technology resulted totally new computational era. Researchers are keen to explore available potential in state of art-machines for breaking the bearer imposed by serial computation. Face Recognition is one of the challenging applications on so ever computational environment. The main difficulty of traditional Face Recognition algorithms is lack of the scalability. In this paper Weighted Local Active Pixel Pattern (WLAPP, a new scalable Face Recognition Algorithm suitable for parallel environment is proposed. Local Active Pixel Pattern (LAPP is found to be simple and computational inexpensive compare to Local Binary Patterns (LBP. WLAPP is developed based on concept of LAPP. The experimentation is performed on FG-Net Aging Database with deliberately introduced 20% distortion and the results are encouraging. Keywords — Active pixels, Face Recognition, Local Binary Pattern (LBP, Local Active Pixel Pattern (LAPP, Pattern computing, parallel workers, template, weight computation.

This book presents the proceedings of the 10th International Parallel Tools Workshop, held October 4-5, 2016 in Stuttgart, Germany – a forum to discuss the latest advances in parallel tools. High-performance computing plays an increasingly important role for numerical simulation and modelling in academic and industrial research. At the same time, using large-scale parallel systems efficiently is becoming more difficult. A number of tools addressing parallel program development and analysis have emerged from the high-performance computing community over the last decade, and what may have started as collection of small helper script has now matured to production-grade frameworks. Powerful user interfaces and an extensive body of documentation allow easy usage by non-specialists.

The combinatorial nature of many important mathematical problems, including nondeterministic-polynomial-time (NP)-complete problems, places a severe limitation on the problem size that can be solved with conventional, sequentially operating electronic computers. There have been significant efforts in conceiving parallel-computation approaches in the past, for example: DNA computation, quantum computation, and microfluidics-based computation. However, these approaches have not proven, so far, to be scalable and practical from a fabrication and operational perspective. Here, we report the foundations of an alternative parallel-computation system in which a given combinatorial problem is encoded into a graphical, modular network that is embedded in a nanofabricated planar device. Exploring the network in a parallel fashion using a large number of independent, molecular-motor-propelled agents then solves the mathematical problem. This approach uses orders of magnitude less energy than conventional computers, thus addressing issues related to power consumption and heat dissipation. We provide a proof-of-concept demonstration of such a device by solving, in a parallel fashion, the small instance {2, 5, 9} of the subset sum problem, which is a benchmark NP-complete problem. Finally, we discuss the technical advances necessary to make our system scalable with presently available technology.

Full Text Available The aim of this study is to present an approach to the introduction into pipeline and parallelcomputing, using a model of the multiphase queueing system. Pipeline computing, including software pipelines, is among the key concepts in modern computing and electronics engineering. The modern computer science and engineering education requires a comprehensive curriculum, so the introduction to pipeline and parallelcomputing is the essential topic to be included in the curriculum. At the same time, the topic is among the most motivating tasks due to the comprehensive multidisciplinary and technical requirements. To enhance the educational process, the paper proposes a novel model-centered framework and develops the relevant learning objects. It allows implementing an educational platform of constructivist learning process, thus enabling learners’ experimentation with the provided programming models, obtaining learners’ competences of the modern scientific research and computational thinking, and capturing the relevant technical knowledge. It also provides an integral platform that allows a simultaneous and comparative introduction to pipelining and parallelcomputing. The programming language C for developing programming models and message passing interface (MPI and OpenMP parallelization tools have been chosen for implementation.

This paper describes IBM's approach to parallelcomputing using the IBM ES/3090 computer. Parallel processing concepts were discussed including its advantages, potential performance improvements and limitations. Particular applications and capabilities for the IBM ES/3090 were presented along with preliminary results from some utilities in the application of parallel processing to simulation of system reliability, air pollution models, and power network dynamics

[ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...

Distributed and Cloud Computing, named a 2012 Outstanding Academic Title by the American Library Association's Choice publication, explains how to create high-performance, scalable, reliable systems, exposing the design principles, architecture, and innovative applications of parallel, distributed, and cloud computing systems. Starting with an overview of modern distributed models, the book provides comprehensive coverage of distributed and cloud computing, including: Facilitating management, debugging, migration, and disaster recovery through virtualization Clustered systems for resear

A broad range of mathematical modeling errors of fluid flow physics and numerical approximation errors are addressed in computationalfluid dynamics (CFD). It is strongly believed that if CFD is to have a major impact on the design of engineering hardware and flight systems, the level of confidence in complex simulations must substantially improve. To better understand the present limitations of CFD simulations, a wide variety of physical modeling, discretization, and solution errors are identified and discussed. Here, discretization and solution errors refer to all errors caused by conversion of the original partial differential, or integral, conservation equations representing the physical process, to algebraic equations and their solution on a computer. The impact of boundary conditions on the solution of the partial differential equations and their discrete representation will also be discussed. Throughout the article, clear distinctions are made between the analytical mathematical models of fluid dynamics and the numerical models. Lax`s Equivalence Theorem and its frailties in practical CFD solutions are pointed out. Distinctions are also made between the existence and uniqueness of solutions to the partial differential equations as opposed to the discrete equations. Two techniques are briefly discussed for the detection and quantification of certain types of discretization and grid resolution errors.

A new set of two-fluid equations that are valid from collisional to weakly collisional limits is derived. Starting from gyrokinetic equations in flux coordinates with no zero-order drifts, a set of moment equations describing plasma transport along the field lines of a space- and time-dependent magnetic field is derived. No restriction on the anisotropy of the ion distribution function is imposed. In the highly collisional limit, these equations reduce to those of Braginskii, while in the weakly collisional limit they are similar to the double adiabatic or Chew, Goldberger, and Low (CGL) equations [Proc. R. Soc. London, Ser. A 236, 112 (1956)]. The new set of equations also exhibits a physical singularity at the sound speed. This singularity is used to derive and compute the sound speed. Numerical examples comparing these equations with conventional transport equations show that in the limit where the ratio of the mean free path lambda to the scale length of the magnetic field gradient L/sub B/ approaches zero, there is no significant difference between the solution of the new and conventional transport equations. However, conventional fluid equations, ordinarily expected to be correct to the order (lambda/L/sub B/) 2 , are found to have errors of order (lambda/L/sub u/) 2 = (lambda/L/sub B/) 2 /(1-M 2 ) 2 , where L/sub u/ is the scale length of the flow velocity gradient and M is the Mach number. As such, the conventional equations may contain large errors near the sound speed (Mroughly-equal1)

The divertor particle simulation code named PARASOL simulates open-field plasmas between divertor walls self-consistently by using an electrostatic PIC method and a binary collision Monte Carlo model. The PARASOL parallelized with MPI-1.1 for scalar parallelcomputer worked on Intel Paragon XP/S system. A system SGI Origin 3800 was newly installed (May, 2001). The parallel programming was improved at this switchover. As a result of the high-performance new hardware and this improvement, the PARASOL is speeded up by about 60 times with the same number of processors. (author)

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.

The objective of this project was to investigate the complex fracture of ice and understand its role within larger ice sheet simulations and global climate change. This objective was achieved by developing novel physics based models for ice, novel numerical tools to enable the modeling of the physics and by collaboration with the ice community experts. At the present time, ice fracture is not explicitly considered within ice sheet models due in part to large computational costs associated with the accurate modeling of this complex phenomena. However, fracture not only plays an extremely important role in regional behavior but also influences ice dynamics over much larger zones in ways that are currently not well understood. To this end, our research findings through this project offers significant advancement to the field and closes a large gap of knowledge in understanding and modeling the fracture of ice sheets in the polar regions. Thus, we believe that our objective has been achieved and our research accomplishments are significant. This is corroborated through a set of published papers, posters and presentations at technical conferences in the field. In particular significant progress has been made in the mechanics of ice, fracture of ice sheets and ice shelves in polar regions and sophisticated numerical methods that enable the solution of the physics in an efficient way.

Full Text Available The conventional wisdom in the scientific computing community is that the best way to solve large-scale numerically intensive scientific problems on today's parallel MIMD computers is to use Fortran or C programmed in a data-parallel style using low-level message-passing primitives. This approach inevitably leads to nonportable codes and extensive development time, and restricts parallel programming to the domain of the expert programmer. We believe that these problems are not inherent to parallelcomputing but are the result of the programming tools used. We will show that comparable performance can be achieved with little effort if better tools that present higher level abstractions are used. The vehicle for our demonstration is a 2D electromagnetic finite element scattering code we have implemented in Mentat, an object-oriented parallel processing system. We briefly describe the application. Mentat, the implementation, and present performance results for both a Mentat and a hand-coded parallel Fortran version.

A 3D ComputationalFluid Dynamics (CFD) simulation of a Heisenberg Vortex Tube (HVT) is performed to estimate cooling potential with cryogenic hydrogen. The main mechanism driving operation of the vortex tube is the use of fluid power for enthalpy streaming in a highly turbulent swirl in a dual-outlet tube. This enthalpy streaming creates a temperature separation between the outer and inner regions of the flow. Use of a catalyst on the peripheral wall of the centrifuge enables endothermic conversion of para-ortho hydrogen to aid primary cooling. A κ- ɛ turbulence model is used with a cryogenic, non-ideal equation of state, and para-orthohydrogen species evolution. The simulations are validated with experiments and strategies for parametric optimization of this device are presented.

The paper presents the Flow Analysis Software Toolset (FAST) to be used for fluid-mechanics analysis. The design criteria for FAST including the minimization of the data path in the computationalfluid-dynamics (CFD) process, consistent user interface, extensible software architecture, modularization, and the isolation of three-dimensional tasks from the application programmer are outlined. Each separate process communicates through the FAST Hub, while other modules such as FAST Central, NAS file input, CFD calculator, surface extractor and renderer, titler, tracer, and isolev might work together to generate the scene. An interprocess communication package making it possible for FAST to operate as a modular environment where resources could be shared among different machines as well as a single host is discussed. 20 refs