Let A be an array of unique numbers of size n, where the minimum and maximum
numbers in the array are min and max respectively.

In this blog post we introduce a parallel algorithm, bfsort, based on
Bloom Filters, that can sort the array A in O(log n) time and O(m)
space where m is the size of: { min, min + 1, min + 2, … , max - 1, max },
while the sequential version, can sort the array A in O(m) time and O(n)
space. We argue that if merge operations of empty arrays that could be stored in
CPU-cache, given unlimited cores, then the memory overhead wouldn’t have that
big of an impact on the parallel version of the algorithm.

As the algorithm is based on Bloom Filters, both the parallel and sequential
versions sort the array A with a collision probability of 0.00001%. The
algorithm is parameterized on the collision probability, which can be adjusted
for optimal results based on the size of the array.

1. Introduction

In this section, we will describe a few data structures and concepts that will
ease the understanding of the following sections:

Sorting a sequence of values could be defined with the following
constraint: { a₀ ≤ a₁ ≤ … ≤ aₙ } where a number is less or equal to it’s
successor in the sequence. When working with sorting algorithms, it’s important
to understand the underlying data structure that stores the values. Arrays
allow additions in linear time, as a new array +1 needs to be allocated on
another spot in memory, while indexed access are in constant time. Lists
allow constant time additions and linear time lookups.

Bloom Filter is a space-efficient data structure as it saves the necessary
information in bits, see Figure 0, which differs from HashTables which
will need at least a boolean, normally implemented as a byte, where only
one bit is set and the rest are unused, to store the information. With regard to
test if an element is in the set, the Bloom Filter can only warranty that an
element definitely is not in set while if an element is in the filter, it
still has to make a lookup on the underlying data structure. There is a
probability that an element is in the filter but it isn’t in the underlying
data structure, false positives. The false positives are due to the set of
hashes can set the same bit for different words. This is why it’s very
important to choose the most space-efficient size of bits, m, in relation to
the probability of false positives, p, while choosing the optimal number of
hash functions, k, that uniformly distribute the values across the filter
[cao]:

Figure 0: Example of a Bloom filter by [eppstein], 15 August 2007.

Asymptotic performance measures:

DAG: A directed acyclic graph, G = (V,E), in this blog post represents
the computations of an algorithm. The vertices, V, represent the
instructions to be computed and the edges, E, represent the dependencies
between the instructions. If there is a edge, (u,v) ∈ E, the instructions
in the vertex v cannot be executed before the instructions in the vertex
u.

Work: The total time of computing all instructions of the algorithm
executed on a computer with a single processor, see Figure 1. Work can be
mapped to the DAG as all the vertices, V, in the graph. The work law
states that if k work can be done in Tₖ(n) time on a computer with k
processors and the amount of work is only T(n), where k · Tₖ(n) ≥ T(n),
then the execution time of the algorithm with k processors, is limited to
the amount of work divided by the amount of processors.

Span: The execution time of the longest path in a DAG, also know as
critical path, see Figure 1. The asymptotic time complexity of the
algorithm is dominated by the critical path. The span law is defined as
follows: no parallel computer with a limited amount of k processors can
execute a parallel algorithm faster than a computer with an unlimited amount
of cores. The computer with unlimited amount of processors can always just
execute the parallel algorithm with k processors: Tₖ(n) ≥ T∞(n).

2. Sorting with Bloom Filters

As mentioned in the previous section when describing the Bloom Filters, in
order to uniformly distribute the values across the filter, we must find the
optimal relation between bits, probability of false positives and number of hash
functions.

For this algorithm, the primary concern is the probability of false positives
and not so much the space efficiency or the number of hash functions. For this
reason, we choose a very low probability as default value, 0.0000001, even
though it means that the amount of bits and hash functions will be slightly
bigger in size than normal. Even though it’s a very low probability number, the
chosen number ensures that the size of the underlying array in the Bloom
Filters doesn’t becomes larger than the length of the array, which could be
translated in term of asymptotic complexity as the algorithm would use n extra
space, where n is the length of the array.

While we populate the Bloom Filter with the values in linear time. It’s
worth noticing that even though we calculate several hashes and perform several
updates on the Bloom Filter, those operations are still in constant
time. We can also find the minimum and maximum values in the array, also in
linear time, more specifically in the same traversal of the array as the one
that performs the updates in the Bloom Filters.

Figure 5: Bloom Filter bits populated with the unique bytes from the array.

After the bits of the Bloom Filter are populated with the values of the
array, we can now begin to execute the part of the algorithm that will produce a
sorted array.

As we have found the minimum and the maximum values in the array, we can now
recursively iterate through all the numbers in the range from the minimum
number to the maximum number.

The technique we use is to exclude numbers which we definitely know are not in
the set. And because we have chose a very little probability of false
positives, we can easily argue that numbers that aren’t excluded, most likely
are in the initial array.

Therefore we can state that the work is in Θ(m), where m is the length
of the interval from the minimum value to the maximum value. As this is a
sequential algorithm, the span will also match the work and therefore be
in O(m).

It’s worth noticing that we only need n extra space, where n is the length
of the array to be sorted.

For an example of the algorithm applied to a list of bytes, please see
Figures 2 - 6.

3. Parallelizing the algorithm

Readers familiar with Bloom Filters will recognize that there are certain
operations in the algorithm described in the previous section, that can be
parallelized.

Figure 7: Pseudocode of a fork/join implementation in F#, an ML-family programming language.

In this section we will write them down and show how we can lower the span,
critical path, if an a computer with an unlimited amount of cores is provided
and a fork/join approach is used, see pseudocode from Figure 7, which will reduce the
global asymptotic time complexity of the algorithm:

3.1 Initialization

In the sequential algorithm, this step is done in linear time by iterating
over each of the elements of the array.

Allocate underlying Bloom Filter array: Allocating the array in memory can
be done in constant time while populating it with initial zero values
can be done in logarithmic time, O(h), where h = log₂ n, when using a
similar parallel algorithm as described in the pseudocode from Figure 7 in
this section.

Seeds and Hashes: As the size of the bits of the underlying Bloom
Filter array much greater-than the size values of the seeds and the
hashes, therefore none of these operations will have an impact on the time
complexity. It’s worth noticing that the random generation of seeds as well as
the instantiation of the hashes based on the seeds can be done in parallel as
well.

The new span for this step is reduced from O(n) to O(h) where h = log₂
n and n is the length of the array.

3.2 Populating

In the sequential algorithm, this step is done in linear time by iterating
over each of the elements of the array.

Updating the Bloom Filter with the hash values: As described in the
sequential algorithm, even though we calculate several hash functions for each
value, these operations don’t change the asymptotic time complexity of this step
as the operations are in constant time. Therefore, as we are able to
parallelize the update of the Bloom Filter with an approach as described in
pseudocode from Figure 7 in this section. The new span for this step is
reduced from O(n) to O(h) where h = log₂ n and n is the length of the
array. It’s worth noticing that Bloom Filter are built on top of the or
operation, which is the most suitable operation to be parallelized. As we don’t
work with bytes we need to ensure that in order to ensure no race
conditions, an architecture with support for concurrent atomic CPU operations
[atomic], must be chosen.

Minimum and maximum: As above, we can also easily calculate these values
by storing locally the value of the comparison between two numbers and bobble
them all the way to the root of the recursive tree. The new span for this
step is reduced from O(n) to O(h) where h = log₂ n is the length of the
array. If the architecture supports concurrent atomic CPU operations, then the
extra space needed can be reduced to O(1).

As both steps are reduced from O(n) to O(h) where h = log₂ n is the length
of the array, we can conclude that the new span for this step is also
reduced from O(n) to O(h).

3.3 Sorting

In the sequential algorithm, the span, O(m), is bound to the work
which is in Θ(m), where m is the length of the interval from the minimum
value to the maximum value.

Querying the Bloom Filter with the hash values We can easily parallelize
the queries of the values contained in interval from the minimum value to
the maximum value with a fork/join approach, reducing the span from
O(n) to O(h’) where h’ = log₂ m. In order to make this value more
understandable, we will showcase the following example with 32-bit unsigned
integers. Lets assume that our list contains some 32-bit uints where the
minimum value equals 0 and the maximum value equal 4.294.967.295 (2³² -
1). Then our span would be O(32) as log₂ 4.294.967.295 = 32.

Merging Once we know if a value is definitely not in set then we can
return an empty array for that case while for the other case we will return an
array of size one containing that element. The merge process will consist of
combining two arrays into one and copying the elements, in parallel into the
newly created array, with the same approach of fork/join as mentioned in
this section. The initialization and merge of the top array will be bounded to
n, where n is the size of the array to sort, therefore reducing the
span of this process O(lg n) as we can create a new array in constant
time and populate it in O(h) where h = log₂ n.

As the second step dominates the first, we can argue that we have reduced from
O(m) to O(h) where h = log₂ n is the length of the array, we can conclude
that the new span for this step is also reduced from O(n) to O(h).

Sadly, we will have an overhead in memory usage as we are storing empty arrays,
for at most m values. But, since the merge of empty arrays, which are
minimalistic in size, could be stored on the unlimited cores CPU-cache, then the
memory overhead wouldn’t have that big of an impact on the algorithm.

3.4 Summary

By combining the time complexities from the different steps:

Tₖ(n) = O(h) + O(h) + O(h) = 3 · O(h) = O(h)

where h = log₂ n and n is the length of the array. Therefore, we can
conclude that the work is Θ(m) where m is the length of the interval
from the minimum value to the maximum value in the array to be sorted and
the span is in logarithmic time.

As mentioned in the previous section, there will be an overhead in memory usage
as we are storing empty arrays for at most m values. If the merge operations
of empty arrays, which are minimalistic in size, could be stored on each of
the unlimited cores CPU-cache, then the memory overhead wouldn’t have that big
of an impact on the algorithm.

4. Implementation

The algorithms are implemented in F#, which could easily be described as the
OCaml implementation for .NET, as we already had a working native
version of the Bloom Filter data structure. Here are a few implementation
details that are worth noticing:

The hash algorithm chosen is MurmurHash3 [smhasher] as it has support for a
seed which can be used to generate different hashes for the same key by re-using
the same algorithm. In order to get a better seeds and hereby ensuring a more
uniform distribution of values across the filter, see Figure 8 for a graphical
representation, the RNGCryptoServiceProvider is used for random number
generation instead of System.Random. See Hash and Random modules for
more information.

Usage of Interlocked.CompareExchange [interlocked] which allows concurrent
atomic CPU operations to perform updates on data values with minimal overhead,
see Figure 9. In order to use these operations with the space-efficient data
structure, we will need to instantiate our underlying array with ref ‘a
values. In case the reader is not convinced that the minimum and maximum numbers
can be found in parallel with these operations, it can easily be refactored to a
binary tree of height h = log₂ n by using a (fork/join) approach, where
the value of the comparisons are stored locally. It’s easier to see that the
span, critical path, is also O(lg n) time but with a bit more space overhead
O(n) as described in the previous section. Please look into the Atomic and
BloomFilter modules for more information.

Figure 9: Several processors reading from the same shared memory but only one can
write atomically in a clock cycle.

In the parallel versions of the algorithm, by using true parallelism
Array.Parallel (system threads) compared to using concurrency with
Async.StartChild (lightweight threads) in a fork/join approach, shows
how there is an extra memory overhead in the first as the operations require to
allocate all the values in the range from minimum to maximum in an array in
order to filter out values that are definitely not in set, while in the second
approach there is less memory overhead as we can use the same approach as
described in the previous paragraph when finding in parallel the minimum and
maximum numbers in an array to merge lists by using lazy list that allows
concatenation in constant time O(1). See the BloomFilter and List.Lazy
modules for more information.

A full implementation can be seen in Appendix A: Source Code.

5. Concluding remarks

We have presented both a sequential as well as a parallel algorithm that uses
the Bloom Filter data structure as an underlying component in order to sort
an array with a very small collision probability of 0.00001%, which can be
incremented and decremented for optimal results based on the size of the array.

We argue that the sequential algorithms asymptotic complexity is in Θ(m)
time and space as the work is in Θ(m), where m is the length of the
interval from the minimum value to the maximum value. As this is a
sequential algorithm, the span will also match the work. Also it’s worth
noticing that we only need n extra space, where n is the length of the array
to be sorted.

And we argue that the parallel algorithms asymptotic complexity is in
logarithmic time time as the dominating operation is bounded to the spanO(lg n) where h = log₂ n is the length of the array. We also argue that the
work is is Θ(m) where m is the length of the interval from the minimum
value to the maximum value in the array and therefore we will need m extra
space.

5.1 Notes:

In order to ease the understanding of the span for the parallel algorithm,
the following example was used: Lets assume that our list contains some
32-bit uints where the minimum value equals 0 and the maximum value
equal 4.294.967.295 (2³² - 1). Then our span would be O(32) as log₂
4.294.967.295 = 32.

In order to limit the amount of work, it would be ideal to use this
sorting algorithm with data that is to not scattered, as the amount of
work is defined by the interval from minimum and maximum values of the
array.

As this is a probabilistic sorting algorithm, you will have a small chance
that the given array is not sorted after executing the algorithm. Therefore
this algorithm shouldn’t be used critical operations.

It can easily be shown if an array is sorted in O(lg n), where n is the
size of the array, by using the fork/join parallel approach mentioned in
section 3. Parallelizing the algorithm, where you just need to ensure that the
head of the merged leaf array is lower or equal to the head of the right array
and bubble that boolean value up to the root of the tree. In case the array is
not sorted, the sorting algorithm could be executed again but by providing an
even lower probability which would result in a bit more of memory overhead and
work.