I'm very new to Haskell, and I have a question about what performance improvements can be had by using impure (mutable) data structures. I'm trying to piece together a few different things I've heard, so please bear with me if my terminology is not entirely correct, or if there are some small errors.

To make this concrete, consider the quicksort algorithm (taken from the Haskell wiki).

This is not "true quicksort." A "true" quicksort algorithm is in-place, and this is not. This is very memory inefficient.

On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort. An example is given in this stackoverflow answer.

How much faster is the second algorithm than the first? Big O notation doesn't help here, because the performance improvement is going to be from using memory more efficiently, not having a better algorithm (right?). I tired to construct some test cases on my own, but I had difficult getting things running.

An ideal answer would give some idea of what makes the in-place Haskell algorithm faster theoretically, and an example comparison of running times on some test data set.

Here, Data.List.sort is just what it is, Naïve.quicksort is the algorithm you quoted, UArray_IO.quicksort and Vector_Mutable.quicksort are taken from the question you linked to: klapaucius' and Dan Burton's answerwhich turn out to be very suboptimal performance-wise, see what better Daniel Fischer could do it, both wrapped so as to accept lists (not sure if I got this quite right):

As you can see, the naïve algorithm is not far behind the mutable solution with Data.Vector in terms of speed for sorting a list of random-generated integers, and the IOUArray is actually much worse. Test was carried out on an Intel i5 laptop running Ubuntu 11.10 x86-64.

The following doesn't really make much sense considering that ɢᴏᴏᴅ mutable implementations are, after all, still well ahead of all those compared here.

Note that this does not mean that a nice list-based program can always keep up with its mutably-implemented equivalents, but GHC sure does a great job at bringing the performance close. Also, it depends of course on the data: these are the times when the random-generated lists to sort contain values in between 0 and 1000 rather than 0 an 1000000 as above, i.e. with many duplicates:

What's quite interesting, (becomes only apparent with really large sizes, which require rtsopts to increase the stack capacity), is how both mutable implementations become significantly slower with -fllvm -O2:

It seems kind of logical to me that the immutable implementations fare better on llvm (doesn't it do everything immutably on some level?), though I don't understand why this only becomes apparent as a slowdown to the mutable versions at high optimisation and large data sizes.

Fantastic, thank you. I suspect the array implementation can be made much better. The answerer admits to directly porting code from Wikipedia, so it's probably not as efficient as it could be.
–
PotatoJul 14 '12 at 11:36

1

I think so too, this is really just "let's port this C code to Haskell changing as little as possible", which can't be expected to be much good. However, the Data.Arrays are generally said to be slower than Vector, so I doubt it could actually get much better than that. — Somewhere in the future, Data Parallel Haskell's [: :] will probably beat them all, including Fortran arrays...
–
leftaroundaboutJul 14 '12 at 11:46

@Potato Indeed, one can make much, much better array versions. And much much better vector versions. Both should beat the list versions by a huge margin.
–
Daniel FischerJul 14 '12 at 16:32

Hrm, I wrote the UArray_IO implementation as sort of a proof of concept that you can write C in Haskell, so I wasn't necessarily going for speed, but I'm rather surprised at how abysmally that code performed in your tests. Anyone have any clues why the IOUArray code I wrote is so slow? Needs more INLINE pragmas, perhaps?
–
Dan BurtonJul 14 '12 at 22:44

On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort.

How much faster is the second algorithm than the first?

That depends on the implementation, of course. As can be seen below, for not too short lists, a decent in-place sort on a mutable vector or array is much faster than sorting lists, even if the time for the transformation from and to lists is included (and that conversion makes up the bulk of the time).

However, the list algorithms produce incremental output, while the array/vector algorithms don't produce any result before they have completed, therefore sorting lists can still be preferable.

I don't know exactly what the linked mutable array/vector algorithms did wrong. But they did something quite wrong.

For the mutable vector code, it seems that it used boxed vectors, and it was polymorphic, both can have significant performance impact, though the polymorphism shouldn't matter if the functions are {-# INLINABLE #-}.

For the IOUArray code, well, it looks fun, but slow. It uses an IORef, readArray and writeArray and has no obvious strictness. The abysmal times it takes aren't too surprising, then.

Using a more direct translation of the (monomorphic) C code using an STUArray, with a wrapper to make it work on lists¹,

I get times more in line with the expectations (Note: For these timings, the random list has been deepseqed before calling the sorting algorithm. Without that, the conversion to an STUArray would be much slower, since it would first evaluate a long list of thunks to determine the length. The fromList conversion of the vector package doesn't suffer from this problem. Moving the deepseq to the conversion to STUArray, the other sorting [and conversion, in the vector case] algorithms take a little less time, so the difference between vector-algorithms' introsort and the STUArray quicksort becomes a little larger.):

The times without optimisation are expectedly bad for the STUArray. unsafeRead and unsafeWrite must be inlined to be fast. If not inlined, you get a dictionary lookup for each call. Thus for the large dataset, I omit the unoptimised ways:

You can see that an inplace sort on a mutable unboxed array is much faster than a list-based sort if done correctly. Whether the difference between the STUArray sort and the sort on the unboxed mutable vector is due to the different algorithm or whether vectors are indeed faster here, I don't know. Since I've never observed vectors to be faster² than STUArrays, I tend to believe the former.
The difference between the STUArray quicksort and the introsort is in part due to the better conversion from and to lists that the vector package offers, in part due to the different algorithms.

At Louis Wasserman's suggestion, I have run a quick benchmark using the other sorting algorithms from the vector-algorithms package, using a not-too-large dataset. The results aren't surprising, the good general-purpose algorithms heapsort, introsort and mergesort all do well, times near the quicksort on the unboxed mutable array (but of course, the quicksort would degrade to quadratic behaviour on almost sorted input, while these are guaranteed O(n*log n) worst case). The special-purpose sorting algorithms AmericanFlag and radix sort do badly, since the input doesn't fit well to their purpose (radix sort would do better on larger inputs with a larger range, as is, it does too many more passes than needed for the data). Insertion sort is by far the worst, due to its quadratic behaviour.

Conclusion: Unless you have a specific reason not to, using one of the good general-purpose sorting algorithms from vector-algorithms, with a wrapper to convert from and to lists if necessary, is the recommended way to sort large lists. (These algorithms also work well with boxed vectors, in my measurements approximately 50% slower than unboxed.) For short lists, the overhead of the conversion would be so large that it doesn't pay.

Now, at @applicative's suggestion, a look at the sorting times for vector-algorithms' introsort, a quicksort on unboxed vectors and an improved (shamelessly stealing the implementation of unstablePartition) quicksort on STUArrays.

The practically identical quicksorts on the STUArray and the unboxed vector take practically the same time, as expected. (The old quicksort implementation was about 15% slower than the introsort. Comparing to the times above, about 70-75% there was spent converting from/to lists.)

On the random input, the quicksorts perform significantly better than the introsort, but on almost-sorted input, their performance would degrade while introsort wouldn't.

¹ Making the code polymorphic with STUArrays is a pain at best, doing it with IOUArrays and having both the sorting and the wrapper {-# INLINABLE #-} produces the same performance with optimisations - without, the polymorphic code is significantly slower.

² Using the same algorithms, both were always equally fast within the precision of measurement when I compared (not very often).

I sure was a bit surprised, too, but then I thought oh well, the majority of work is done on small subarrays so we have lots of references to different places anyway so why should the list solution not compare, provided the memory allocations are fast? Good you corrected that.
–
leftaroundaboutJul 14 '12 at 16:44

Yeah, GHC can allocate blazingly fast. But not allocating is still faster.
–
Daniel FischerJul 14 '12 at 16:50