Sorting data in parallel CPU vs GPU

Posted on February 4, 2013 by Paul

For many programmers sorting data in parallel means implementing a state of the art algorithm in their preferred programming language. However, most programming languages have a good serial sorting function in their standard library. It appears to me, that the obvious thing to do is to first try to use what your language library provides. If this approach is not successful, you should try to find an existing library that is used, and consequently well debugged, by other programmers. Only as a last resort, you should implement a new sorting algorithm from scratch.

In the case of C++, we have a well tested sorting function in the STL, std::sort, unfortunately std::sort will use only a fraction of the processing power available in a modern multicore system.

Suppose now, that you have two sorted arrays and your task is to combine them as a single sorted array. We are fortunate enough to have a function made just for these kind of situations - std::merge. This means, in a nutshell, that when we need to sort some data, we can split this data in a few pieces, sort each piece of data in a separate thread and use std:merge to combine the sorted pieces - a poor man’s parallel merge-sort algorithm.

In the next figure I’ve tried to exemplify the above idea. We start, for e. g., with a ten elements array. We split this array in four chunks, after this, each chunk of data is passed to std::sort in a separate thread. We end up with four chunks of sorted elements. Now, we can use std::merge to combine the sorted data also in parallel, in pairs, we do this until we end up with a single piece of data. Please note that the numbers in the next diagram represents array indices:

We can take a concrete example to see how the algorithm works:

In order to implement the above algorithm in C++, we start by implementing two helper functions, one for partitioning the data and one for filling an array with random numbers:

I’ve used three C++ compilers: gcc-4.6.3 which is the default compiler on Ubuntu 12.04, gcc-4.7.2 and Clang-3.3svn with libc++. If you are interested in building from sources gcc-4.7.2 and Clang-3.3 you can consult this and this. Each test was repeated one hundred times, the data was averaged and we’ve calculated the standard deviation.

Next image presents the raw data - time versus number of elements for CPU 1/8 threads and for the GPU:

Here, we’ve used the GPU time to normalize the data for the CPU:

Next figure presents the data for CPU only, where we’ve used the parallel data to normalize the results:

The last image suggests that using the building blocks of the STL we can obtain about 2.5 speedup for sorting data in parallel versus the serial version.

In the next table, we present the size in bytes of the executable generated by each compiler:

Compiler

Gcc-4.7.2

Gcc-4.6.3

Clang-3.3svn

Nvcc

Executable size [bytes]

36926

48860

25575

826378

I’ve received a few messages about not taking into consideration the cost of moving data from CPU to GPU and the other way around. Your complaints were heard. In the next set of figures the GPU time contains the cost of moving the data back and forth. Also, I’ve replaced the data for the single thread test with a single call to std::sort. The original and the modified code is on Github. Without further ado let’s see the figures.

The raw data - time versus number of elements for CPU 1/8 threads and for the GPU:

Here, we’ve used the GPU time to normalize the data for the CPU:

Last image presents the data for CPU only, where we’ve used the parallel data to normalize the results:

Disclaimer:All data and information provided on this site is for informational purposes only. solarianprogrammer.com makes no representations as to accuracy, completeness, currentness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis. solarianprogrammer.com does not collect any personal information about its visitors except that which they provide voluntarily when leaving comments. This information will never be disclosed to any third party for any purpose. Some of the links contained within this site have my referral id, which provides me with a small commission for each sale. Thank you for understanding.