Tools

"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."

We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.

"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."

The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.

... by using parallel versions of the underlying routines. It is also possible to create a level 3 BLAS library based on fast algorithms for the GEMM operation, e.g., Strassen's or Winograd's algorithms =-=[25, 26, 15, 11]-=-. Our contribution is two-fold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS, which are structured to e ectively reduce data tra c in a memory hierarchy. Second, the G...

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking hand-coded BLAS3 routines. ``Proof of concept&apos;&apos; is demonstrated by racing the in-place algorithm against manufacturer&apos;s hand-tuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI&apos;s C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.

...ne account for almost a four-fold improvement in running times during the course of our development. We have not tested the attractive hybrid composed of Strassen's recurrence and this one; stability =-=[17]-=- and in-place constraints that are already satisfied here would have to be relaxed there. However, for the balanced, dense matrix (Case 0) Strassen's original 18-addition presentation also bifurcates ...

"... In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matri ..."

In this paper we report on the development of an efficient and portable implementation of Strassen&apos;s matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matrix sizes and shapes and the additional memory needed for temporary variables has been minimized. Replacing DGEMM with our routine should provide a significant performance gain for large matrices while providing the same performance for small matrices. We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer comparisons to other codes. Our performance data reconfirms that Strassen&apos;s algorithm is practical for realistic size matrices. The usefulness of our implementation is demonstrated by replacing DGEMM with our routine in a large application code.

"... Abstract. Strassen's algorithm for fast matrix-matrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY-2 and CRAY Y-MP supercomputers. Several techniques have been usCd to reduce the scratch space requirement for this algorithm while simultaneously preserving a h ..."

Abstract. Strassen&apos;s algorithm for fast matrix-matrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY-2 and CRAY Y-MP supercomputers. Several techniques have been usCd to reduce the scratch space requirement for this algorithm while simultaneously preserving a high level of performance. When the resulting Strassen-based matrix multiply routine is combined with some routines from the new LAPACK library, LV decomposition can be performed with rates significantly higher than those achieved by conventional

"... . Matrix--matrix multiplication is normally computed using one of the BLAS or a reinvention of part of the BLAS. Unfortunately, the BLAS were designed with small matrices in mind. When huge, well conditioned matrices are multiplied together, the BLAS perform like the blahs, even on vector machines. ..."

. Matrix--matrix multiplication is normally computed using one of the BLAS or a reinvention of part of the BLAS. Unfortunately, the BLAS were designed with small matrices in mind. When huge, well conditioned matrices are multiplied together, the BLAS perform like the blahs, even on vector machines. For matrices where the coefficients are well conditioned, Winograd&apos;s variant of Strassen&apos;s algorithm offers some relief, but is rarely available in a quality form on most computers. We reconsider this method and offer a highly portable solution based on the Level 3 BLAS interface. Key Words. Level 3 BLAS, matrix multiplication, Winograd&apos;s variant of Strassen&apos;s algorithm, multilevel algorithms AMS(MOS) subject classification. Numerical Analysis: Numerical Linear Algebra 1. Preliminaries. Matrix--matrix multiplication is a very basic computer operation. A very clear description of how to do it can be found in many textbooks, e.g., [1]. Suppose we want to multiply two matrices A : M \Theta ...

"... In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of n-by-n matrices can be done by any algorithm in O(n ω+η) operations for any η> 0, then it can be done stably in O(n ω+η) operations for any ..."

In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of n-by-n matrices can be done by any algorithm in O(n ω+η) operations for any η&gt; 0, then it can be done stably in O(n ω+η) operations for any η&gt; 0. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in O(n ω+η) operations. 1

...epend on n, using the equivalence of norms on a finite-dimensional space, thereby changing the constant c slightly. The bound (1) was first obtained for Strassen’s O(n 2.81 ) algorithm [46] by Brent (=-=[12, 33]-=-, [34, chap. 23]) and extended by Bini and Lotti [6] to a larger class of algorithms. In prior work [23] we showed that such a bound holds for a new class of fast algorithms depending on group-theoret...

"... In this paper, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication alg ..."

In this paper, we present a program generation strategy of Strassen&apos;s matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen&apos;s matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this paper, we present a non-recursive implementation of Strassen&apos;s algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen&apos;s algorithm synthesized from tensor product formulas required working storage of size O(7 n ) for multiplying 2 n \Theta 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implem...

...al matrix multiplication. Efficient parallel implementations of this algorithm have been described in [1, 10]. This algorithm has been used for fast matrix multiplication in implementing level 3 BLAS =-=[9]-=- and linear algebra routines [2]. In this paper, we describe the tensor product formulation of Strassen's matrix multiplication algorithm, and discuss program generation for shared memory vector proce...

"... Inversion of a triangular matrix can be accomplished in several ways. The standard methods are characterised by the loop ordering, whether matrix-vector multiplication, solution of a triangular system, or a rank-1 update is done inside the outer loop, and whether the method is blocked or unblocked. ..."

Inversion of a triangular matrix can be accomplished in several ways. The standard methods are characterised by the loop ordering, whether matrix-vector multiplication, solution of a triangular system, or a rank-1 update is done inside the outer loop, and whether the method is blocked or unblocked. The numerical stability properties of these methods are investigated. It is shown that unblocked methods satisfy pleasing bounds on the left or right residual. However, for one of the block methods it is necessary to convert a matrix multiplication into the solution of a multiple right-hand side triangular system in order to have an acceptable residual bound. The inversion of a full matrix given a factorization PA = LU is also considered, including the special cases of symmetric inde nite and symmetric positive de nite matrices. Three popular methods are shown to possess satisfactory residual bounds, subject to a certain requirement on the implementation, and an attractive new method is described. This work was motivated by the question of what inversion methods should be used in LAPACK.