This small patch changes the balancing computation to work in
parallel, if possible.

While the normal linking is against the single-threaded runtime, if
the code is linked against the multi-threaded one, the balancing will
get a significant speedup (80% efficiency at 4 cores, 60% at 12 cores,
and with GC tweaks it can reach 70%+).

On the single-threaded runtime, due to the fact that we only use the
weak head normal form, it doesn't introduce any extra penalties,
neither in space nor in CPU time (or if there are, they are too small
to detect easily).