Parallel

Parallel Merge

By Victor J. Duvanenko, February 08, 2011

Algorithm Improvement through Performance Measurement: Part 12

For the next several articles in this series, we'll explore parallel and sequential merge algorithms. We'll utilize Intel Threading Building Blocks as well as Microsoft Parallel Patterns Library (PPL), which is part of Visual Studio 2010. We'll implement several different parallel and sequential merge algorithms, and investigate their performance on today's multicore processors. In this article, we'll leap right into a very interesting parallel merge, see how well it performs, and attempt to improve it.

Merge

Merge is a fundamental operation, where two sets of presorted items are combined into a single set that remains sorted. For example, given two sets of integers

5, 11, 12, 18, 20

2, 4, 7, 11, 16, 23, 28

The resulting merged set is

2, 4, 5, 7, 11, 11, 12, 16, 18, 20, 23, 28

The first set has five elements, whereas the second set has seven, and the result set has twelve. Merge does not require the source sets to be of equal size. Merge can operate on two input arrays, and produce an output array. The output array has to be large enough to hold the result, which is generated in O(n) time.

Listing One shows a simple sequential merge algorithm implementation, where two read-only input arrays of different sizes are merged into a single output array. The input arrays must be presorted, monotonically incrementing. The resulting output array will also be sorted.

The first if statement compares the first element from a[] array and the first element from b[] array, outputting whichever is smaller, with ties choosing the element from a[] to retain stability. Each array has a pointer that keeps track of the current position within that array. Only the pointer of the source array that is output gets incremented. When the current position of either array goes past the end of that array, the first while loop terminates. When all of the elements from one of the arrays have been used, the other array may still have elements left. At that point, comparisons between input arrays are no longer possible. The second and third while loops output any remaining elements.

Merge is O(n), where n is the number of output elements, since one element is output during each iteration of the while loop(s). More complex merges support more than two input arrays, in-place operation, and can support other data structures such as linked lists.

Divide-and-Conquer Merge

The aforementioned sequential merge algorithm does not parallelize in an obvious fashion. Divide-and-conquer algorithms typically parallelize well. Merge is tricky since it's O(n) and making the order worse is not desirable. In other words, we want an O(n) or better divide-and-conquer merge algorithm. The authors of the latest (3rd edition) Introduction to Algorithms book [1] develop, describe and analyze such an algorithm, in the new chapter on multithreaded algorithms. Listing Two shows a direct translation of the pseudocode parallel implementation into sequential C++ implementation. A binary search algorithm is also shown, which is a translation of the pseudocode on p. 799. Figure 1 illustrates how the algorithm works by showing the process within one level of recursion.

The input to the divide-and-conquer merge algorithm comes from two sub-arrays of T, and the output is a single sub-array A. The two input sub-arrays of T are from [p1 to r1] and from [p2 to r2]. The output sub-array of A is from [p3 to r3]. The divide step is done by choosing the middle element of the larger of the two input sub-arrays (at index q1 in Figure 1 and in Listing Two) . The value at this index is then used as the partition element to split the other input sub-array into two sections — less than X in Figure 1 and greater than or equal to X. The partition value X (at index q1 of T) is copied to array A at index q3. The conquer step recursively merges the two portions that are smaller than or equal to X — indicated by light gray box and arrow shading. It also recursively merges the two portions that are larger than or equal to X — indicated by darker gray box and arrow shading.

The algorithm proceeds recursively until the termination condition — when the shortest of the two input sub-arrays has no elements. At each step, the algorithm reduces the size of the array by one element. One merge is split into two smaller merges, with the output element placed in between. Each of the two smaller merges will contain at least N/4 elements, since the left input array is split in half.

The algorithm is stable, because each recursion step is stable. At each step, the elements of the left array are smaller (or equal to) the elements of the right array. When the split is made using the mid-left element, only right array elements that are strictly less than the split element (X) are used in the subsequent right smaller merge, which is the crucial step ensuring stability.

Table 1 and Graph 1 show performance measurements for merging of two random input arrays that are pre-sorted, for the simple merge, STL merge, and divide-and-conquer merge.

[Click image to view at full size]

Table 1.

[Click image to view at full size]

Graph 1.

The simple merge and STL merge algorithms are always faster than the divide-and-conquer merge algorithm, by 2-5 times. Performance measurements confirm all algorithms to be linear. The number of items in Table 1 and Graph 1 indicates the size of each of the two input arrays (of equal size), producing the resulting array of twice that size. Each item is a 32-bit unsigned number. Thus, the simple merge algorithm outputs items at a rate of 153 Million/second. The total memory bandwidth is 1.2 GB/second for reads and writes.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!