When theory meets practice

By Jim, on October 25th, 2011

A fairly common problem in programming is selecting the “best” items from a large group. For example, I have an application that makes video recommendations based on user input. The program takes the user’s input (typically the name of a song, an artist or a band), matches that up with our index of videos, and then does some calculation to come up with a relevance score for each video. The details of that calculation are interesting, but will have to be covered another time.

When all the calculation is done, I’ll have a list of perhaps two million possible video recommendations, and I need to select the top 200 (based on relevance score) to present to the user. There are several ways to do this, as described in Selecting k smallest or largest elements.

In the discussion below, n is the total number of items in the list and k is the number that I want to select. Using the numbers above, n would be equal to 2,000,000, and k would be 200.

When I first wrote the code, I used the simple heap selection algorithm. In pseudo code, it looks like this:

Create an empty min heap
Add the first k items from the array to the heap
for items k+1 to n
if (item > smallest item on the heap)
Remove smallest item from the heap
Add new item to the heap

That algorithm performs quite well with our data. It is somewhat sensitive to order. If the recommendations in the items array were sorted by relevance score, this would perform very poorly because every item would have to be added to the heap, and all but the last k items would be removed from the heap. Fortunately, it would be highly unlikely to see such an ordering in our data.

I was discussing this problem with a friend recently who suggested that my algorithm was less than optimal and that I should look into a better approach.

In big O notation, that heap selection algorithm’s complexity is O(n log k). Adding an item to a heap of size k takes up to log(k) operations, as does removing an item. And since in the worst case every item gets added to and removed from the heap, there will be n insertions and n removals. That’s 2*(n log k), but constant factors are removed in big O notation, so we’re left with O(n log k).

There are at least two algorithms listed in the Wikipedia article that have better theoretical running times than my simple heap selection technique. One technique involves turning the array of two million items into a max heap and then doing a breadth-first traversal of the first k nodes. Turning an array into a heap is an O(n) operation, and the breadth-first search is O(k log k), giving us a complexity of O(n + k log k).

The other interesting algorithm is called QuickSelect. It uses the Quicksort partitioning algorithm to partition the array in linear time. QuickSelect is based on the idea of finding the median element in an array, which you can do in linear O(n) time. Given the median element, partitioning an array such that all items less than the median are to the “left,” and all items greater than or equal to the median are to the “right” is also an O(n) operation. That’s not exactly how QuickSelect works, but that’s the idea behind it. I should note that QuickSelect has expected linear time. Its worst case, though (when data is ordered or partially ordered), is really bad.

So in theory, the fastest algorithm should be QuickSelect, followed by the breadth-first heap search (which I call BFSHeap), followed by the simple heap selection algorithm (HeapSelect). I thought I’d do some tests to see how well the theory reflects reality.

The BFSHeap turned out to be a bad idea. Although the theoretical complexity analysis tells us that it should be faster than HeapSelect, the O(n) time to build the heap has a huge constant. Whereas the time to build the heap is proportional to the number of items, that “proportion” is very large. It takes more than twice as long for BFSHeap to turn the array into a heap as it does for QuickSelect to select items. Because of that, I didn’t include BFSHeap in my detailed analysis. It’s interesting to note, though, that BFSHeap performs better than HeapSelect when k exceeds about 8% of n. But it’s still much slower than QuickSelect.

QuickSelect, like QuickSort, can be inefficient if the data is sorted or reverse-sorted, and is also sensitive to ordered (or partially ordered) subranges. A simple median of three (or, better, median of five) pivot selection reduces or eliminates those cases. I used a median of three in my tests. Median of five will give slightly better average performance, and much better worst case performance.

HeapSelect, as I noted above, also performs poorly with ordered data, although it’s not as sensitive to partially ordered subranges as is QuickSelect.

Here’s the data for those of you who don’t want to read the rest of this note. In short, HeapSelect is faster if you’re selecting up to about 1% of the items in the list. QuickSelect is faster beyond that, and hugely faster before you get to 10%. HeapSelect has the disadvantage of requiring O(k) extra space, but when k is small, that extra space won’t normally be a problem.

Time (milliseconds) to select k items from an array of 2,000,000

20

200

2,000

20,000

200,000

QuickSelect (random)

24.36

25.66

25.18

25.85

28.54

HeapSelect (random)

4.97

5.23

7.34

24.49

139.06

QuickSelect (sorted)

15.18

15.02

14.97

14.95

14.97

HeapSelect (sorted)

107.91

196.84

235.28

293.36

330.93

QuickSelect (reverse)

7.81

7.73

7.63

7.65

7.60

HeapSelect (reverse)

5.11

4.92

5.07

6.23

22.81

QuickSort (random)

356.90

357.95

357.96

358.24

358.01

QuickSelect1 (sorted)

153.58

1,520.43

n/a

n/a

n/a

QuickSelect1 (reverse)

80.56

791.03

n/a

n/a

n/a

All of my tests were written in C# and run in release mode with the 64-bit runtime in .NET 4.0. Each test was run 100 times after an initial run to eliminate JIT times, and the time reported is the average of those 100 runs.

The QuickSort is included for comparison. Obviously if you sort the data then taking the top k items is trivial. But it’s expensive. In the average case, QuickSelect was more than 10 times as fast as QuickSort.

As noted, QuickSelect is an O(n) algorithm. Given an array of size n, it will take the same time to run, regardless of how many items you want to select from the array. Selecting 200 items from an array of 2,000,000 takes just as long as selecting 200,000 items from the same array. My QuickSelect implementation running on the configuration I noted above runs at about 13 milliseconds per million items. Selecting from an array of 100,000 items takes about 1.3 milliseconds. With 2,000,000 items, it takes about 26 milliseconds. The times are very consistent across a large range.

QuickSelect1 is a modified QuickSelect without the median-of-three pivot selection. I included the timings to show how poorly a naive QuickSelect implementation will perform when presented with ordered data. I only included timings for selecting 20 and 200 items, as it’s obvious where things are heading: to select 2,000 items would take 10 times as long as it took to select 200 items. QuickSelect with a median-of-three pivot selection reduces the likelihood of running across these pathological cases, but it’s still possible to hit them in real-world data. Check out median of three killer for more information. A median-of-five pivot selection would be better.

HeapSelect is O(n log k) in the worst case, meaning that the more items you want to select, the longer it will take. Selecting 200 items will be a lot faster than selecting 200,000 items. How much longer is an interesting calculation. Based on the theoretical worst-case running times, the difference should be log(200000)/log(200), or somewhere in the neighborhood of about two and a half times as long. But that’s only true in the worst case–when the data is in sorted order. The real running time is much more skewed.

The running time of HeapSelect is dominated by the number of items added to the heap. When the data is unordered, the actual number of items added to the heap is proportional to k–the number of items to be selected. When k is small in comparison to n, the number of items added to the heap is also small. For example, here are some sample values, again run against an array of 2,000,000 items. m is the number of items actually added to the heap.

k

m

m/k

2

29

14.5

20

249

12.45

200

2,038

10.19

2,000

15,831

7.915

20,000

112,093

5.605

200,000

660,508

3.303

When selecting 200 items, only about 2,000 items are added to the heap. When selecting 200,000 items, 660,000 items are added to the heap: about 300 times more items. This explains why the actual running times are so much different from the theoretical running times. It doesn’t take 300 times as long to select 200,000 items, but it does take 25 times longer.

Note also that when k is 2, m is about 15 times k. As k increases, m also increases, but not in proportion. When k reaches 200,000, m is only about 3 times as large. And, of course, when k is the same size as the array, m == k.

In my initial tests, HeapSelect used a generic BinaryHeap collection class similar to this one. In an attempt to level the playing field (after all, QuickSelect was hand-coded), I moved the relevant portions of that BinaryHeap collection in-line so that I was operating on the items array directly. The result is between three and four times as fast as the original BinaryHeap. This resulted in HeapSelect being faster than QuickSelect when k is up to about one percent of n. With the original version QuickSelect is faster when k exceeds one-tenth of one percent of n.

The primary reason for the difference in run time is that the custom heap code uses an array rather than a generic list, and there is minimal error checking. In addition, I was able to combine the “remove lowest and add new” sequence into a single “replace” operation–something that the generic BinaryHeap implementation didn’t supply.

The ideal selection algorithm, would be a hybrid that uses the best approach based on how many items are being selected. I haven’t seen such a thing. There is a hybrid, Introselect, based on David Musser’s Introsort, which gives better performance than QuickSelect if the data is ordered, but it’s unclear to me whether it would perform better than HeapSelect when k is a very small in comparison to n.

HeapSelect, by the way, has a couple of other advantages over QuickSelect and similar algorithms. First, it doesn’t modify the array. QuickSelect moves things around, which might be a bad thing. You can of course make a copy of the array and use QuickSelect on that, but then you’re using O(n) extra space. If you can’t modify the input array, then HeapSelect becomes much more attractive.

On a similar note, HeapSelect works if you don’t know the total number of items to be examined, or if you can’t hold the entire list of items in memory. Say, for example, you want the top 200 items from a file of a several billion records–more records than will fit in memory. HeapSelect shines in this situation because it can start at the beginning and go through the file sequentially. When it reaches the end of the file, the heap will contain the top 200 records. Another example is selecting the top n things that you see over time, for example records coming in over a network stream. With HeapSelect, you don’t need to keep track of every item–just the top k that you’ve seen so far.

The lesson here is that, whereas it’s important to know the theoretical complexity of the algorithm that you’re using, actual running times vary based on the implementation and on the nature of the data. In this case, the “inferior” HeapSelect outperforms QuickSelect when the number of items to select is small in proportion to the total number of items. QuickSelect is a better general-purpose selection algorithm and the one I would select if I could only have one, but HeapSelect is clearly better for a large number of real-world situations. It’s common to want the top .01 percent of items from a large list, and HeapSelect is four or five times faster than QuickSelect in that case.

It’s also important to remember that big O notation ignores constant factors. O(n) means that the running time will be proportional to n: for every order of magnitude change in n, the running time will increase by an order of magnitude. An O(n log n) algorithm is, theoretically, less efficient than an O(n) algorithm. However, big O doesn’t say, what the “proportion” is. For example, the real running time of that O(n) algorithm might be 100,000 * n whereas the real running time of the O(n log n) algorithm is 10 * n * log(n). The O(n log n) algorithm will be faster than the O(n) algorithm until 10 * log(n) exceeds 100,000.

So know your algorithms, but also know your data. When possible, select the algorithm that will give you the best performance for your expected application rather than the algorithm that gives the best overall performance.