Monday, March 5, 2012

A while ago I was wrapping my head around parallel merge sort. Since it requires additional O(n) space in practice it is the best choice if available memory is not constrained. Otherwise it is better to consider quick sort. As in merge sort in quick sort we have two phases:

rearrange elements into two partitions such that left one contains elements less than or equal to the selected pivot element and greater or equal to the pivot elements are in the right one

recursively sort (independent) partitions

Second phase is naturally parallelized using task parallelism since partitions are independent (partition elements will remain inside partition boundaries when the sort of the whole array is finished). You can find an example of this behavior in parallel quick sort. It is a good start. But the first phase is still contributes O(n) at each recursion level. By parallelizing partition phase we can further speed up quick sort.

An interesting point is that we do not know in advance how the partitioning will be done since it is data dependent (position of an element depends on other elements). Still independent pieces can be carved out. Here is the core idea.

Let’s assume an array that looks like below where x denotes some element, p denotes selected pivot, e is equal, l is less and g is greater elements than pivot.

e l l (g l g e e l) x x x x x x x x x (l l g g e l) g e g p

left right

Let’s assume we selected two blocks of elements within array called left (containing elements g l g e e l) such that all elements before are less or equal to the pivot and right (that holds elements l l g g e l) such that all elements after it are greater or equal to the pivot from left and right ends respectively. After the partitioning against pivot is done left block will hold elements less than or equal to the pivot and right block will contain elements greater that or equal to the pivot. In our example left block contains two g elements that do not belong there and right block holds three l elements that must not be there. But this means that we can swap two l elements from right block with two g elements from left block and left block will comply with partitioning against pivot.

e l l (l l l e e l) x x x x x x x x x (g g g g e l) g e g p

left right

Overall after blocks rearrange operation at least one of them contains correct elements (if the number of elements to be moved in each block is not equal).

Then we can select next left block try to do the same. Repeat it until blocks meet piece by piece making left and right parts of the array as partitioning wants it to be. Sequential block based algorithm looks like this:

select block size and pick block from left and right ends of the array

rearrange elements of the two blocks

pick next block from the same end if all elements of the block are in place (as the partitioning wants them to be)

repeat until all blocks are processed

do sequential partitioning of the remaining block (since from a pair of blocks at most one block may remain not rearranged) if one exists

Interesting bit is that pairs of blocks can be independently rearranged. Workers can pick blocks concurrently from corresponding ends of array and in parallel rearrange elements.

A block once taken by a worker should not be accessible by other workers. When no more blocks left worker must stop. Basically we have two counters (number of blocks taken from left and right ends of the array). In order to take a block we must atomically increment corresponding counter and check that sum of the two counters is less than equal to total number of blocks otherwise all blocks are exhausted and worker must stop. Doing under a lock is simple and acceptable for large arrays and blocks but inefficient for small arrays and blocks.

We will pack two counters into a single 32 bit value where lower 16 bits are for right blocks counter and higher 16 bits are for left blocks. To increment right and left blocks counters 1 and 1<<16 must be added to combined value respectively. Atomically updated combined value allows to extract individual counters and make decision on whether block was successfully taken or not.

Since each worker may attempt to race for the last not taken block care should be taken of overflow. So only 15 bits are used for each counter and so it will require 1<<15 workers to cause overflow that is not realistic.

...
// Class that maintains taken blocks in a thread-safe way.
private class BlockCounter
{
private const int c_minBlockSize = 1024;
private readonly int m_blockCount;
private readonly int m_blockSize;
private int m_counter;
private const int c_leftBlock = 1 << 16;
private const int c_rightBlock = 1;
private const int c_lowWordMask = 0x0000FFFF;
public BlockCounter(int size)
{
// Compute block size given that we have only 15 bits
// to hold block count.
m_blockSize = Math.Max(size/Int16.MaxValue, c_minBlockSize);
m_blockCount = size/m_blockSize;
}
// Gets selected block size based on total number of
// elements and minimum block size.
public int BlockSize
{
get { return m_blockSize; }
}
// Gets total number of blocks that is equal to the
// total number devided evenly by the block size.
public int BlockCount
{
get { return m_blockCount; }
}
// Takes a block from left end and returns a value which
// indicates whether taken block is valid since due to
// races a block that is beyond allowed range can be
// taken.
public bool TakeLeftBlock(out int left)
{
int ignore;
return TakeBlock(c_leftBlock, out left, out ignore);
}
// Takes a block from ringt end and returns its validity.
public bool TakeRightBlock(out int right)
{
int ignore;
return TakeBlock(c_rightBlock, out ignore, out right);
}
// Atomically takes a block either from left or right end
// by incrementing higher or lower word of a single
// double word and checks that the sum of taken blocks
// so far is still within allowed limit.
private bool TakeBlock(int block, out int left, out int right)
{
var counter = unchecked((uint) Interlocked.Add(ref m_counter, block));
// Extract number of taken blocks from left and right
// ends.
left = (int) (counter >> 16);
right = (int) (counter & c_lowWordMask);
// Check that the sum of taken blocks is within
// allowed range and decrement them to represent
// most recently taken blocks indices.
return left-- + right-- <= m_blockCount;
}
}
...

With multiple workers rearranging pairs of blocks we may end up with “wholes”.

In the example above blocks l1, l4 and r1 are the wholes in left and right partitions of the array meaning they were not completely rearranged. We must compact left and right partitions such that they contain no wholes.

Now we have parallel implementation of the quick sort partition phase. Experiments with random generated arrays of integer values show that it helps to speed up parallel quick sort by approximately 50% on a 8 way machine.