Meta

Parallel Stream Compaction: [4] Metadata Volume

This is nr. 4 in a collection of blog posts on Parallel Stream Compaction. In previous posts we saw that the implementation by Billeter, Olsson, and Assarsson is faster than implementations by Orange Owls and Spataro; a multiplicative relation holds, even over different graphics cards. We saw that the essential differences between the fastest algorithm, by Billeter et al., and the others, are that the former uses sequences and a very small data structure for metadata, and the latter programs do not.

In the previous post we saw that the introduction of sequences in itself does not provide a performance advantage. The question we address here is: Can the sheer volume of the metadata account for the performance differences?

The size of the meta data structure

The element counts of the metadata structures corresponding to best time performance on a stream of 2^24 elements are:

512 unsigned ints in Billeter, Olsson, and Assarsson’s algorithm.

16,384 ints in Spataro’s algorithm.

16,384 ints in Orange Owls’ implementation of Spataro

Billeter et al. have a metadata structure of 512 elements (of which they use 480 elements). The size is set at compile time and is part of a configuration the authors deem optimal. If you do not change the code, the size fixed.

The size of the metadata structure in Spataro’s and Orange Owls’ code equals the number of blocks, hence it varies with the threading configuration. In the first post in this series we determined the best time performance for a stream of 2^24 ints of both programs. The nr of elements in the metadata structures corresponding with the best performance was 16,384 in both cases.

The large difference in size is a consequence of design: Spataro’s and Orange Owls’ algorithms collect metadata per block of threads. Billeter et al.’s: code collect metadata per sequence.

Parameter search

To find out if there are significant differences, I did a parameter search for optimal numbers of threads per block and the number of thread blocks for two approaches:

With sequences; the number of elements in the metadata array equals 512.

Without sequences; the number of elements in the metadata array equals the number of blocks.

The test code that made the comparisons again uses a stream of 2^24 unsigned ints, selected in a rand() way. The fraction of valid elements is about one half, again decided by rand(). Both approaches were tested with scalars, pairs and quads.

Time performance results

The table below shows the performance for both alternatives; for thread counts that seem to fit the algorithms best.

Block Wise (µs)

Sequence Wise (µs)

Threads

512 (16,384 blocks)

64 (256 blocks)

Scalar

771.0

502.1

uint2

509.6

445.1

uint4

448.2

445.4

We see that the block wise algorithm can take great advantage of vectorized reads. For uint4 the algorithm is just as fast as the sequence wise algorithm. The latter can take only limited advantage of vectorized reads, but is in itself blazingly fast.

Lesson Learned

Compare these results to the table in part 2 of this series. In that table Billeter et al. have 455.5 µs for the count phase. They use vectorized reads with pairs. Notice how similar the time measurements in both tables are for this IO configuration. Spataro and Orange Owls however, use scalar based IO, which is much slower. This then is the explanation why the code by Spataro and Orange Owls is slower than that by Billetter et al.: The formers do not use vectorized IO! It also explains why the difference in time performance is a factor, i.e. has a multiplicative character: Billeter et al. read and write twice as much data in a single action.

So, the code by Spataro and Orange Owls could have been just as fast as the code by Billeter et al., if they just had used vectorized IO. It is neither the use of sequences, nor the size of the metadata structure.

This is an important lesson: It is knowledge of the language and platform facilities that makes a program the fastest one, or a slower one.

Demo Code

The complete demo code for this blog post can be downloaded from here.

Next

Next up: Now that it is clear that all three implementations of parallel stream compaction that have been explored in this blog post series could have had about equal performance, I would like to combine the software I wrote or adapted from the sources investigated, into a short and fast program for parallel stream compaction.