Quicksort is typically more efficient than other comparison sorting algorithms in both auxiliary memory requirements and execution cycles. Yet, it shares the same common case time complexity limit of its peers - O(n*log n).

Parallelism is the tool to effectively breach that barrier. As a binary tree sort, Quicksort is especially suited for parallelism since multiple threads of execution can work on sorting task parts without synchronized data access.

We will detail two effective synchronization policies for parallel Quicksort in Java. One is usable in production now, one is coming soon in JDK 7.

Naive First Take

When writing a parallel algorithm, identify your synchronization policy early on. This answers the question, "How does the caller know when we're done?"

The simplest synchronization policy we have is the JVM call stack - straight serial programming.
To model that with threads, recursively spawn a new thread to work on every new partition made. Then, wait for that thread to finish. Here's how that might look:

Break out the Fail Whale. Spawning threads is expensive - orders of magnitude more expensive than function calls - and we're spawning 2*n threads here! Even with an infinite number of cores, the naive approach is a great way to model glacial movement. We need a way to manage and limit all these threads running around.

Deceptively simple, our second take yet has shortcomings that render it unusable.

The overhead for execute is still quite greater than recursive method calls. Internally, execute pushes the Runnable onto a shared blocking queue. Threads then pull from the queue, creating a single synchronization point of contention.

The execute call is asynchronous, so we have no way to let the calling process know when the work is done.

1 and 2 are problems of communication overhead.

Granularity

Communication overhead makes parallelism have diminishing - usually negative - returns below a certain task size. The simple fix is to serially execute a task if it is smaller than a certain threshold.

But, finding the optimal granularity is non-trivial since it can fluctuate greatly between machines and inputs. One approach is to instrument our algorithm with live profiling and a feedback loop. That's another article.

As a simplifying alternative, we can set a hard threshold. Via experimentation, I found 0x1000 (4096) is roughly optimal on my quad-core machine.

Auxiliary data requirements plummet from O(n) - and that with a large constant factor - to O(log n).

data locality: most of the work is now done in contiguous segments of size SERIAL_THRESHOLD or less on the same thread. This takes full advantage of processor caches.

That leaves the problem of notifying the invoking process when the sort is complete.

Count-Down Latch (approach 1)

A count down latch is a synchronization primitive that allows one or more threads to wait on a set of tasks being performed in other threads. As each bit of work is completed, workers decrement the count. Once the count hits zero, all waiting threads are signaled, notifying the work is done.

For our case, the size of the work and the initial value of our count-down latch is equal to the size of the array being sorted. As each array value is placed in its sorted position, the count-down latch is decremented accordingly. When all values are in their sorted position, the count-down latch hits zero and the invoking process is signaled.

Lucky for us, JDK 5 introduced a CountDownLatch.
Unlucky for us, it only exposes a single decrementing function, countDown(), that decrements the count by one. Our algorithm needs to efficiently decrement the count in increments of SERIAL_THRESHOLD or less.

Fortunately, such a count-down latch is easy to write and I provide one here: CountDownLatch.java

[5,8,38] In addition to the array reference, each task needs references to the count-down latch and the thread pool. So, some refactoring was in order to encapsulate that trio tuple. LatchQuicksortTask holds a reference to the array, latch, and pool. LatchQuicksortSubTask does the actual work and holds a single reference to its root; the first sub-task is spawned in the root's run() method.

[33,36] The root task encapsulates the count-down latch, and initializes the count to the size of the array to sort. The invoking process can block until the sort is done by calling waitUntilSorted().

[42]partitionOrSort(int,int) is called from the sub-task, thus from inside the pool.

[48] If we are at the threshold, we will sort a whole section at once. We calculate the size of the area sorted so we can decrement our latch later.

[51] In this case, we partition the sub-section per normal Quicksort. The nature of partition is such that if a poor pivot is selected, one partition might be very small. In the extreme case where the pivot is equal to a boundary, that value is already sorted and will not be included in the spawned sub-tasks [18,20]. So we must account for it or waiting processes will never wake. See the comment on countSortedBoundaryValues [68] for more details.

[54] Decrement the latch by count previously calculated according to whether we sorted or partitioned.

On a quad-core machine, this algorithm nets 300% speedups over serial quicksort! So it works, what's not to like?

The execution model still has a serial bottleneck. Our thread pool executor service uses a single, shared blocking queue to schedule tasks. The algorithm will not scale linearly as we add cores and threads.

Complexity. This approach goes a long way to obscure C.A.R. Hoare's classic recursive algorithm. The count-down latch derivative scarcely resembles its ancestor.

We can do better.

Fork/Join (approach 2)

As of JDK 7, Java has a first-class Fork/Join framework, compliments of the immortal Doug Lea. A detailed explanation of the Fork/Join concepts and Java implementation is well deserving of its own post.

In brief, the shared-queue model is replaced with one dequeue per thread. Each thread pulls and pushes work at one end of its dequeue. When a thread runs out of work, it will round-robin steal work from the opposite end of other thread's dequeues.

For our purposes, Fork/Join addresses the two shortcomings of LatchQuicksortTask.

As long as each thread in a Fork/Join pool is pushing/pulling work from the one end of its own dequeue, no synchronization is needed! Shared mutable state (and synchronization) only comes into play when threads steal work from each other.

The Fork/Join framework automatically takes care of signaling threads that are waiting on tasks to finish, so we can write our algorithm in a style more akin to serial quicksort.

Waiting for a Future to complete (call get()) blocks the invoking thread until the work completes. If you do this inside a task executed by a thread belonging to a pool, it will prevent that thread from completing its current task and returning to the pool for more work. For quick sort, it will only take a few partitions to block all the threads in the pool, causing deadlock.

This is in contrast to ForkJoinTask#join(), which need not block, but returns to the queue for more work until the joined task is done.