Supporting the hypercube programming model on mesh architectures
(A fast sorter for iWarp tori)
Thomas M. Stricker
Carnegie Mellon University
Pittsburgh, PA 15213-3891
Abstract
Many combinatorial problems have simple solutions for parallel
processing on highly-connected networks such as the butterfly or the
hypercube, whereas the fastest processor-to-processor intercon-
nections are realized in parallel machines with low dimensional mesh
or torus topology. This paper presents a method for mapping binary
hypercube- algorithms onto lower dimensional meshes and analyzes this
method in a model derived from the architecture of modern mesh
machines. We outline the criteria used to evalu- ate graph embeddings
for mapping supercomputer communication networks.
Our work was motivated by the need for fast library routines to do
parallel sorting, fast Fouriertransformation and processor syn-
chronization. During the design effort of these building blocks, we
developed and analyzed a new technique to support a hypercube network
embedded onto a two dimensional torus. A direct implementation of
the embedding is made possible by logical channels and pathways. A
fast merge sorter based on the bitonic network serves as an example to
show how a simple hypercube algorithm can outperform most of the
asymptotically optimal mesh algorithms for practical machine sizes.
In the conventional mesh computation model, processors are allowed
to exchange one unit of data with a neighbor in each step. This model
needs to be refined since modern mesh computers, such as the iWarp
system, have hardware support for fast non-neighbor communication.
The bitonic merge sort, a simple hypercube algorithm, contains a
fair amount of fine grain parallelism not found in standard mesh
algorithms. This form of parallelism includes pipelined communi-
cation, computation overlapped with communication, use of wide
instruction words and operands directly read from the communication
system through systolic gates.
6
The measured sorting rate of more than 2 X 10 keys/sec on an
iWarp torus with just 64 processors shows the excellent absolute
performance of our approach. The performance results compare
well with much larger parallel computers. In our analysis of the
relative performance we compare our approach to different sorting
methods on meshes. The mapped hypercube algorithm is shown to
be best for a wide range of machine and problem sizes.
For the readers mainly interested in complexity results, our ap-
proach may seem somewhat surprising, but the analysis of the algo-
rithm in an accurate model for the iWarp machine shows how good
speed and good parallel efficiency is obtained from both forms of
parallelism, large and fine grain.
Note: Reprint from the proceedings of the ACM Symposium on Parallel
Algorithms and Architectures, SPAA92 June 29 - July 1, 1992, San
Diego, California.