A blog about programming, and tiny ways to improve it.

Improving Our Parallel Intrinsic

I mentioned in my previous post that we are using a single primitive
parallel operation to implement PJs. It turns out that I am not very
satisfied with what we currently have and I have been thinking about
some alternatives. In this post I’ll describe briefly how things
are setup, what problems I see, and then sketch out how I think we
could improve it.

How things work now: %ParallelBuildArray()

The current intrinsic is %ParallelBuildArray(length, func, args...). It
attempts to construct an array in parallel using a pool of N worker
threads. Conceptually, %ParallelBuildArray() allocates an array
result of length length and then instructs each worker thread to
invoke func(result, id, N, warmup, ...args), where:

result is the result array that is being construted in parallel;

id is the id of the worker thread, ranging from 0 to N;

N is the total number of worker threads;

warmup is a special flag that is false for the real execution and
true during a special pre-compilation step. I will say more on
warmup later.

...args means that any extra arguments passed to %ParallelBuildArray()
are passed on to the function. This is occasionally useful.

%ParallelBuildArray() is not guaranteed to succeed: the function
func might not be executable in parallel, for example. Therefore,
on success, it returns the new array, but on failure it returns
undefined.

Here is an example of how %ParallelBuildArray() can be used to
implement a version of map() on standard JS arrays (not
ParallelArray) that potentially executes in parallel:

I think this code should be fairly self-explanatory. One thing to note
is that ComputeTileBounds(length, id, N) is a helper that just
evenly divides the total length among N workers and returns the
id‘th portion. So let’s look a bit closer now at how
%ParallelBuildArray() works.

Compilation for parallel execution. Now, normally when you have a
function call in SpiderMonkey, you begin by interpeting the function.
If the function is hot, it gets JIT compiled, first with JaegerMonkey
and then with IonMonkey. However, to invoke func in parallel, we
cannot take this same path. This is because SpiderMonkey—like all
JS engines—is not thread-safe, so we can’t use the interpreter and
so forth. Instead what we do is to skip right to the final
stage—IonMonkey compilation (hereafter just IM). So basically
%ParallelBuildArray() tries to compile func using IM in a special
mode, called parallel execution mode.

When a function is compiled for parallel execution, it is restricted
to a subset of JS. This subset contains only operations that we have
implemented in a thread-safe way. It will grow over time. Sometimes,
the function may be allowed to compile even if it looks like it might
do unsafe things: in those cases, the unsafe things will result in a
bailout dynamically, which is basically like throwing a special
exception that says “this parallel code tried to do something that is
not (yet) allowed in parallel execution!”

However, skipping the interpreter and JaegerMonkey and going straight
to IM compilation is a bit difficult. The problem is that
SpiderMonkey is designed to gather type information and other data
during the interpreter phases. This data is then used by IM to
generate more efficient code. Generating efficient code is very
important to parallel execution because, as I explained in
an earlier post, efficient code is generally threadsafe code.

Enter the warmup phase. To handle this situation,
%ParallelBuildArray() does something called a warmup. Basically, it
invokes func() in just the same manner as it would normally do,
except (1) sequentially and (2) with the warmup flag set to true.
The warmup flag is used by the function to do only a few iterations,
just enough to gather some data but not to do any real work. During
warmup, the interpreter is in a special mode that records what
functions get called (more on that later).

Once warmup is complete, we have enough type information to compile
func(). However, we also need to compile any functions that
func() might invoke during its execution! The reason is that we
can’t (at least not currently) do compilation during the parallel
execution, as it might trigger GC, affect the type inferencing
results, and do all kinds of non-thread-safe things. So we want to
get those compilations done up front. This is where that list of
functions we gathered during warmup comes into play: we also make sure
that all of those are compiled for parallel execution at the same
time. Since warmup only executes a few iterations, it’s possible of
course that func() will run code or invoke functions during runtime
that we haven’t seen yet. In that case, we bailout to sequential
execution dynamically.

Thread safety. There is one other wrinkle. Normally, code
executing in parallel is not permitted to write to shared objects.
This is enforced by the ion compiler, which only accepts writes that
it can see are legal. However, if you recall our fill() function,
you can see that it includes a write to the (shared) result array:

This is safe because each thread has a distinct id and hence each
thread will write to disjoint portions of the result array. But how
does the IM compiler know this? The answer, of course, is that it does
not. Instead, there is a hack that says: “the function passed to
%ParallelBuildArray() is permitted to write to its first argument”.
Essentially that function must be a trusted function.

What’s wrong with this approach?

I have three problems with this approach:

%ParallelBuildArray() is too specialized: it always creates a single
new array. Sometimes I want to process an existing array, or create
multiple arrays.

The “special case the first argument” approach to thread safety is
inelegant. %ParallelBuildArray() can only be applied to trusted
functions.

The warmup phase is not sound if the user function has side-effects,
and this is hard to fix.

Let’s look at the first two points first, as they are well understood
and the solution is basically agreed upon within the group. Our plan
to fix them is to change %ParallelBuildArray() into something like
%ForkJoin(func, args...), which no longer creates any result arrays
but instead just invokes func(id, N, ..args) in parallel from each
worker. This is basically the same as %ParallelBuildArray(), except
that it does not create an array. We would then have a second
intrinsic, %UnsafeSetElement(), that permits unsafe writes into an
array. Therefore, our map code would look something like:

As you can see, we first create the buffer and then try to fill it,
either sequentially or in parallel. One thing which is important is
that we have to ensure that buffer is preallocated and initialized.
This is because the various threads in the parallel case will be
writing into it at random locations without coordination, so they
can’t be growing the buffer dynamically. We also must be prepared for
GC, which means that the memory must be initializated. For now, I’ve
just written an explicit for loop in JavaScript to grow and initialize
the array, but there are other, more efficient options. I’ll discuss
this matter specifically in a later post.

In any case, moving to %ForkJoin() addresses the first two points
pretty well. The workers can write to any number of arrays as output.
There is no special-casing about which arguments are mutable. It
would be fine to pass any old function to %ForkJoin()—unsafe
actions are only possible by using the %UnsafeSetElement() intrisic,
which is only permitted in self-hosted code.

So, what about warmup?

The third problem I mentioned is that warmup is not sound if the user
function has side-effects. The reason is that both
%ParallelBuildArray() (what we have today) and %ForkJoin() (what I
just proposed) are supposed to basically have no effect at all if they
return. But if they ran some warmup iterations, those iterations may
have modified data structures, and that’d be bad. Now, the current
strawman for Parallel JavaScript, at least, specifies that the
user-provided functions should not be mutating global state, but
it’s up to the engine to enforce that of course. And
I am not sure that’s the right thing, in any case!

Another thing I don’t like about the current approach to warmup is
that it only runs a small subset of the iterations and this subset
doesn’t grow. So if there is some behavior that is not captured by the
warmup, it never will be. Moreover, the work done in warmup is thrown
away.

So I was thinking. What if we just did away with the warmup phase
altogether. Instead, we write the code that invokes %ForkJoin() in
such a way that if %ForkJoin() fails, the sequential fallback is
itself the warmup. That means that the final version of our parallel
map would look something like this:

In this version of the code, we try to run in parallel, but if that
fails we do a “little more work” sequentially and then try in parallel
again. This means that the sequential loop effectively serves as the
warmup. Note that it is important for the sequential loop to invoke
the same fill() function that is used by the parallel code or else
this technique won’t work.

You may recall that part of the purpose of the warmup was to keep a
list of those functions that get used during the compilation so that
they be compiled for parallel execution. My current plan for how to
address this is to take advantage of the monitoring of callsites that
our type inferencer already does. Basically we can come up with a
conservative approximation of the functions that get called by walking
the bytecode and ensure that those are compiled. If during execution
we hit functions that are not compiled, %ForkJoin() can compiled
them in parallel execution mode and then try again.

Future enhancements

Off the top of my head I can think of some easy enhancements. There
are likely more.

Avoiding initialization. It would be nice if we could avoid
initializing the full array, which is after all an O(n) sequential
cost. I have a vague plan for this wherein we would allocate the
array to its full, final capacity but not initialize the data
(SpiderMonkey already distinguishes between the amount of memory
allocated and the amount of memory initialized). The %ForkJoin()
call would then always be writing into the uninitialized portion; if
it succeeds, it would simply advance the “initialized” counter to the
full capacity. That requires some care to ensure that it plays well
with the GC, but doesn’t seem terribly hard.

Detecting things that cannot be parallelized. One thing which
would be useful is to add dynamic monitoring to detect functions that
repeatedly fail to execute in parallel, perhaps because they are
inherently non-threadsafe. This is similar to the monitoring that
already takes place to find functions that are not suitable to Ion
compilation, as Ion doesn’t support the full set of JavaScript either.
In that case, we could rewrite the loop to look something like: