PSTL Parallel Algorithms

As in the STL, the PSTL algorithms are generic -- the arguments to
the algorithms are not the containers themselves, but rather
iterators which access container elements. This approach has
several advantages. First, the same algorithm can be used for any
container which utilizes an iterator of sufficient capability.
So, for example, an algorithm which applies a function to each
element can be called for any container which provides an iterator
capable of the increment ( ++) operation. Second,
algorithms can be easily applied to subranges of elements by
passing as arguments to the algorithm the iterators which mark the
beginning and one past the end of the subrange. In addition,
through the use of user-defined iterator adaptors, subgroups of
elements, such as odd- or even- indexed elements or elements with
values above a certain threshold, can be accessed.

There are three types of parallel algorithms in the PSTL:

STL algorithms with parallel semantics,

par_ versions of STL algorithms, and

par_ algorithms for standard parallel operations.

The algorithms in the first group retain their STL names, but allow
pariterator arguments in place of iterators. These new
algorithms are collective and have parallel semantics.

The algorithms in the second group are also versions of the STL
algorithms, but par_ will be prepended to their names. When
invoked with parallel iterators, these algorithms are semantically
equivalent to the first type of algorithm (and are collective). When
invoked with local iterators, these algorithms are not collective.
Execution is local to a particular context, but has parallel semantics (so
a loop may be parallelized, for example).

The differentiation between the versions of these algorithms is
accomplished using iterator tags. The body of each algorithm consists of
a call to another function with an added argument:
iterator_category(first). The local or parallel iterator version of
the function is called, depending on the iterator tag. Note that this
differentiation is done at compile time using function templates.

In this example, let us assume that 3 contexts are involved. We first
create a distributed vector with 100 elements of type double.
Since the default distribution is used (no ContainerRatio object
is specified in the constructor), 33 elements are stored in the first two
contexts and 34 are stored in the last. Each element is initialized to
5.0.

The algorithm sums the elements in the iteration space [first, last)
using initial as the initial value.

The first call to accumulate is a non-collective operation and
results in the invocation of the STL version of accumulate in the
calling context. While it does not need to be invoked in each context, it
is in this case. Each invocation is, however, completely independent of
the other contexts. In contexts 0 and 1, the value of sum1 is
165.0 (5 * 33). In context 2, the value is 170.0 because there is one
extra element.

The second accumulate call is differentiated from the first by the
use of parallel iterators. It is a collective operation and must be
called in all contexts. The local elements are summed in each context and
then a global sum is computed using a reduction operation. In all
contexts, sum2 is 500.0.

Because it is invoked with local iterators, the first call to
par_accumulate is a non-collective operation. Unlike
accumulate, however, par_accumulate with local iterators uses
single-context parallelism, if available, to compute the local sum.
Lightweight threads are used in a context to accumulate partial sums of
the local elements; these are summed to get the total for the local
context. The value of sum3 in each context corresponds to the
value of sum1; they differ only in the processing involved.
Finally, the second call to par_accumulate is equivalent to the
second call to accumulate since behavior of the algorithms is the
same when they are called with parallel iterators.

The third group of algorithms consists of special par_ algorithms
such as par_apply, par_scan, and par_reduce. These are
collective operations with parallel semantics. The following have been
defined thus far:

par_apply: Applies a function object
pointwise to the elements of a set of containers.

par_reduction: Applies a function object pointwise to the elements of
a set of containers and then does a reduction on an associative binary
operator.

par_scan: Applies a function object pointwise to
the elements of a set of containers and then does a parallel prefix
operation using an associative binary operation.

As is the case for all PSTL algorithms, the main arguments to these
algorithms are iterators which access container elements. When
invoked with local iterators, the algorithms are non-collective and
use single-context parallelism. When invoked with parallel iterators,
the operations are collective and have parallel semantics across
contexts. The function object may change only the value of the
elements in the space defined by the first two iterators in the
argument list. Changes made to elements accessed via any other
parallel iterator argument are lost since the function is applied to a
temporary local copy of those elements.

Figure 1: Example of par_apply usage

For example, the declaration of par_apply for a binary function object
is:

Here, the first iteration space is defined by [begin1, end1).
If this iteration space contains k elements, then the second
iteration space consists of [begin2, begin2 + k). In applying
the binary function, each element in the first iteration space is
paired with the corresponding element in the second iteration space.

Figure 1 shows the two vectors, with the elements from the iteration
spaces shaded in gray. The processing within par_apply in each
context is as follows:

The iteration space
[x.parbegin(),x.parbegin()+10) is restricted to its local elements.
As shown, in context 0, seven elements are local and in context 1, three
elements are local.

The portion of
[y.parbegin(),y.parbegin()+10) which corresponds to these local
elements is identified. In context 0, this begins at
y.parbegin(), while in context 1 it begins at y.parbegin()+7.

The operator() method of add_to is applied to each
pair of elements in the iteration spaces identified in steps 1 and 2.
Paired elements are connected by arrows in the figure. Note that all of
the elements in the first iteration space are local, while in the second
space elements may be local or remote. In the case of remote elements, a
remote fetch is done to get a local copy for the operation. Step 3 is
done in parallel if single-context parallelism is supported.