Benchmarks vs R vs Python

The performance of string sorting is a nuanced topic. It appears that Julia is the fastest (as is shown in the cover photo). However, the story can change if we look at a couple more synthetic examples.

Julia is the fastest when the number of unique strings is close to the number strings, and Python comes a distant second, while R is third.

R is the fastest if there are lots of duplicates values (or put another way if the ratio of unique strings to strings is small, e.g. 1:100) and if there are a large number of elements; Julia can sometimes beat R sometimes even if there are lots of duplicates if the number elements to sort is small (e.g. 10 million); this is shown in the below benchmarks.

Why is R so fast? It is using a form of string interning which is discussed in more detail later in the blog post. Theoretically, this approach requires more set-up time. Julia doesn’t have interned strings as default and hence is not able to perform the optimization that R uses out-of-the-box.

Side note about alternative data types: factor/categorical

If the number of unique strings is small one can use factor/categorical types in Julia/R/Python to represent the string-vector instead of using strings. These can yield significant speed up in sorting performance with optimized algorithms.

I have yet to be able to find a high-quality string radix sort implementation in Python, see this SO post. So Python’s results may improve if I can find such an implementation.

The journey towards faster string-sort in Julia

If you are interested in the journey leading up to the implementation of string radix sort, read on. You might find these points interesting

How to load underlying bytes from strings?

Some pointers about pointer arithmetic in Julia

Motivation and previous state of String sorting in Julia

Being able to sort strings fast is a key pillar of modern data manipulation. Although it’s often acknowledged that when we sort a vector of strings, what we actually want is to group them; but it is still valuable to be able to sort strings fast.

However, an initial investigation revealed that string sorting in Julia is slow compared to R when sorting strings with lots of duplicated values. This is probably comparing Julia to C via R but from a user’s perspective, to put it most bluntly, they probably don’t care. A 3x performance drop is not a fantastic story for Julia. It’s meant to be fast, right? Also, Python is also slow (see benchmarks above) and hopefully pandas2 can help address that.

Towards faster string sorts

With that in mind, I wanted to investigate if Julia can become fast at string-sorting as well; at least get close to R’s performance in string sorting. After some research I found that R uses radix sort to sort strings, so a natural starting point is a Julia implementation of string radix sort.

Most of my research point towards some variant of Most Significant Digit (MSD) radix sort for strings, see 1 and 2. Also, there is an LSD radix sort for some bits type (but not strings) already implemented in SortingAlgorithms.jl. So I have implemented radix sort algorithms of both MSD and LSD varieties.

About radix sort

I found these lecture notes to be a good introduction to string radix sort. Even though the source code is in C, one can easily translate it line-by-line to Julia.

The below is a couple of issues that I have encountered while developing the radix string sort algorithms.

Problem 1: access to underlying bytes

To perform a radix sort, one needs access to the underlying bytes. One way to load the byte of the nth character in a string is via codeunit(s, n) e.g.

charAt(s::String, n) = @inbounds codeunit(string, n)

I timed the above, and according to my calculations, this will be too slow to match R’s performance.

After much experimentation, I found that loading 8 bytes at a time is almost as fast as loading just 1 byte, so that became my preferred approach. E.g. see below

# `pointer` returns a pointer to `UInt8` (i.e. a byte) that points to the first byte of a string
# `Ptr{UInt64}` converts the pointer to a pointer of `UInt64` and so `unsafe_load`
# will load exactly 8 bytes (64 bits) starting from the location pointed at by the pointer
load_8bytes(s) = s |> pointer |> Ptr{UInt64} |> unsafe_load

There are two sub-issues with this approach as well:Sub-issue 1: string shorter than 8 bytes
One has to be careful about how to address the shorter case. One approach is to load 8 bytes anyway and set the unneeded bits to 0; this approach can result in trying to access memory not available to the program and cause a crash. This was pointed out to me during the code review process. From my testing, loading 8 bytes for shorter length strings are fine most of the time, but we still have to be careful not to crash the program. The way to address it currently is to test if the length is shorter than 8 bytes and then use a slower loader.

There are more optimizations possibilities. For example, most of the time only a small fraction of strings are stored in locations where loading 8 bytes will cause an issue. To understand this better we need some understanding of how memory is organized; below is my summarised understanding

data is loaded into memory in pages of a certain size (on most 64-bit machine the size will be at least 4kb)

when byte-loading you can load from anywhere within the same page, but loading across page boundaries may crash the program

therefore only those strings that are 8 bytes or less from a page boundary will cause an issue

as pointed out by Julia Discourse member @stevengj, one can check if a string s is near the boundary using (UInt(pointer(s)) & 0xfff) > 0xff8

Finally, it’s worthwhile to remember that the slowest part of the sorting algorithm is not in the loading of bytes, it is the actual sorting.

Sub-issue 2: string longer than 8 bytes
If the string is longer than 8 bytes I can sort the string vector iteratively 8 bytes at a time. There are well-known methods for doing both in the MSD and LSD variants of radix sort, which I shall not repeat here.

Problem 2: Permuting strings while sorting radix

Once I have loaded the underlying bytes into a bytes-vector I can sort the bytes-vector using radix sort which is quite fast. However, I also need to permute the original string-vector at the same time. To do this I have coded up a sorttwo!(bytesvec, stringvec) function which sorts the bytes-vector bytesvec and permute the string vector in the same way bytesvec is permuted in the sorting process. The sorttwo! function is a simple adaptation of the existing radix sort function in SortingAlgorithms.jl.

The other application of sorttwo! is in the implementation of a faster sortperm for strings. For R users, sortperm is the equivalent of R’s order.

Implementation of MSD and LSD algorithms

I have implemented an MSD and an LSD variant. From my research, it is often the case that MSD algorithms work better for variable length strings and LSD algorithms work best for fixed length algorithms.

Some even claim that LSD doesn’t work on variable length strings vector. I think this is not true as you can represent an empty byte with 0 (even though that’s technically a null). For example, if I load the string “abc” using the 8-byte loader it becomes in, hexadecimal form, 0x6162630000000000 where 61, 62 and 63 are the hexadecimal representation of the ASCII codes of “a”, “b”, and “c”. I can then sort that using radix sort along with other strings. However, whether this is the most efficient is the real question, to which I do not have an answer.

My implementation of MSD radixsort is based on radix 3-way quicksort which is well-known and documented in 1 and 2 already.

From my benchmarking, my implementation of MSD is not as performant as the LSD algorithm even for variable length strings. This seems odd as most of my research point toward MSD being more performant than LSD. This could be an indication that my implementation of MSD radix sort is suboptimal.

Actually, why is R so fast? Opening windows to other approaches

A number of individuals pointed out that R uses a form of string-interning to store its strings. My understanding of how it works is like this: for example consider a = c(“abcdefghi”, “abcdefghi”) is a vector of two strings containing the same content, so a[1] and a[2] just point to the one storage space for “abcdefghi” instead of storing two copies of the same string.

There is a global cache for CHARSXPs created by mkChar — the cache ensures that most CHARSXPs with the same contents share storage.

If the same string is only stored once, this can lead to space efficiencies. Also, more importantly, one may be able to exploit that (like R has) to make more performant algorithms.

Furthermore, this has the potential of simplifying group-by operations. If the user knows that all strings with the same content have the same pointer, then we can simply group-by the pointer which is of fixed size and is numeric and hence quicker to sort and group.

However, Julia doesn’t have intered strings by default (although there is a package InternedStrings.jl), and therefore these types of optimization are not readily available and hence why it may be hard for Julia to match R’s string sorting performance in all cases. But this does open up an alternative lens of looking at things: R will take longer to load these strings as they also need to load it into the global cache. This longer loading time resulted in faster sorting speeds. Therefore it may be possible in Julia to create a data structure that mimics R’s behavior and result in more performant sorting. Therefore currently comparing R’s sorting speeds to Julia’s is not the complete story, even though on the surface R appears faster, and from a users’ perspective, (once the data is loaded) R is still the king of speed.

Future works

An obvious way to speed things up is to adopt parallel techniques, This paper “Engineering Parallel String Sorting for Multi-Core Systems” is the top Google result for “parallel string sort”. Its findings show that “multi-key quicksort” (a multi-pivot variant of the MSD radix (quick)sort I have ported) is the fastest sequential sorting algorithm they found implemented in C/C++. This is worth investigating. They also pointed towards their parallel variant called Super Scalar String Sample Sort(S^5), which is performant for multi-core systems.

As discussed, my implementation of MSD string radix sort might be sub-optimal. This paper points towards Rantala’s C/C++ implementations of string radix sort as high quality, so that could be a good benchmark to try and match.

I have experimented with converting Strings to InternedStrings via the InternedStrings.jl package and I found the performance to be too slow. There are optimization possible for the conversion process, so we will check back once those optimizations are in to see if we can leverage InternedStrings to create even faster string sorts.