PARALLELISM-TUTORIAL

This documentation topic relates to the experimental extension of ACL2
supporting parallel execution and proof; see parallelism.

In this topic we introduce the ACL2 parallelism primitives using the example
of a doubly-recursive Fibonacci function, whose basic definition is as
follows. See parallelism for a very high-level summary of the parallelism
capability described here, and see compiling-acl2p for how to build an
executable image that supports parallel execution. Here, we assume that
such an executable is being used.

A simple way to introduce parallelism into this function definition is to
wrap the addition expression with a call of pargs, and the arguments to
the addition will be computed in parallel whenever resources are available.
As a result, we end up with a very similar and thus intuitive function
definition. Note that we replaced + by binary-+, since
pargs expects a function call, not a macro call.

After you submit the above two versions of the Fibonacci function, test them
with the following forms.

(time$ (fib 10))
(time$ (pfib 10))

Now increase the argument by increments of 5 to 10 until you find your
curiosity satisfied or your patience wearing thin. You can interrupt
evaluation if necessary and return to the ACL2 loop. You will immediately
notice that you have not increased execution speed, at least not by much, by
introducing parallelism.

First, consider the computation of (pfib 4). Assuming resources are
available, (pfib 4) will create a thread for computing (pfib 3) and
another thread for (pfib 2). It is easy to imagine that setting up each
thread takes much longer than the entire computation of (fib 4).

Second, we must realize that if we have two threads available for computing
(fib 10), then the evaluation of (fib 8) will probably complete
before the evaluation of (fib 9). Once (fib 8) finishes, parallelism
resources will become available for the next recursive call made on behalf of
(fib 9). If for example that call is (fib 3), we will waste a lot of
cycles just handing work to the thread that does this relatively small
computation. We need a way to ensure that parallelism resources are only
used on problems of a "large" size. Ensuring that only "large" problems
are spawned is called the ``granularity problem.''

In summary: We want to tell ACL2 that it can evaluate the arguments of
pargs in parallel only when the parameter of pfib is greater
than some threshold. Our tests using CCL have suggested that 27 is a
reasonable threshold.

Explicit Programming for the Granularity Problem

One way to avoid the granularity problem is to duplicate code as follows.

Duplicating code is fundamentally a bad design principle, because it can
double the work for future maintenance. A ``granularity form'' is an
expression

(declare (granularity <form>))

that can allow threads to be spawned (without duplicating code) whenever the
evaluation of <form> results in a non-nil value. It may be placed
inside any call of a parallelism primitive, in a position documentated
separately for each primitive. Here is a definition of pfib using this
feature for a call of the parallelism primitive pargs.

Test each version as follows (or substitute your own natural number for 33).

(time$ (fib 33))
(time$ (pfib 33))

Another Granularity Issue Related to Thread Limitations

Our implementation of parallelism primitives has the property that once a
thread is assigned a computation, that assignment stays in effect until the
computation is complete. In particular, if a thread encounters a parallelism
primitive that spawns child threads, the parent thread stays assigned,
waiting until the child computations complete before it can continue its own
computation. In the meantime, the parent thread reduces the number of
additional threads that Lisp can provide by 1, rather than being reassigned
to do other work.

How can this lack of reassignment affect the user? Consider, for example,
the application of a recursive function to a long list. Imagine that this
function is written so that the body contains a recursive call, for example
as (pargs (process (car x)) (recur (cdr x))). Each such pargs call
that spawns child work must wait for its children, one of which is the work
of evaluating (recur (cdr x)), to complete. There is an ACL2 limit on
how much pending work can be in the system, limiting the number of
pargs calls that can result in parallel execution. For example, if
the ACL2 limit is k and each call of pargs actually spawns threads for
evaluating its arguments, then
after k cdrs there will be no further parallel execution.

Possible solutions may include reworking of algorithms (for example to be
tree-based rather than list-based) or using appropriate granularity forms.
We hope that future implementations will allow thread ``re-deployment'' in
order to mitigate this problem. This problem applies to plet,
pargs, pand, por, and spec-mv-let.

Consider ``genetic trees'' that contains leaves of DNA elements, in the sense
that each tip is one of the symbols A, G, C, or T. First we
define the function valid-tip which recognizes whether a tip contains one
of these symbols.

Before we can time the functions, we need to create test trees. We have
found that test trees need to be approximately 1 gigabyte before we clearly
see speedup, and we make them asymmetric to demonstrate the ability of our
implementation to adapt to asymmetric data. We can create the trees with the
execution of the following forms.

The serial version took awhile, while the parallel version finished almost
immediately. The test for validation was split into testing the car
and the cdr of the *invalid-tree* root, and since the cdr was
equal to nil, its test returned immediately. This immediate return
quickly interrupted the computation associated with the car, and returned
the result.

We can test this form by executing our previous forms. You will probably
find some speedup on a machine with several cores available, but more speedup
can likely be obtained with an expensive test on the leaves in place of
valid-tip.

Por can be used in the same way as pand, but with early
termination occurring when an argument evaluates to a non-nil value, in
which case the value returned is t. Finally, por also supports the
use of a granularity form.