Saturday, May 20, 2006

Selection of the top search results

A google for "Scheme" gives "Results 1 - 100 of about 356,000,000" in 0.40 seconds. Selecting these top 100 results could be done by sorting the 356,000,000 results after rank and returning the 100 greatest. It is however very seldom that a user will ever want to examine more than the top 500 results, thus it is unnecessary to sort more than the top results.

The trick to find the n greatest elements of a list xs is to use a heap.

Begin by inserting the first n elements of the list into a heap. The elements of the heap should be thought of as the top n candidates.

For each remaining element x of the list we compare it with the minimum element of the heap. If x is smaller than the minimum element, then the n heap elements are all greater than x, and x is therefore not in the top n. If on the other hand x is larger than the minimum element, then x is inserted into the heap instead of the previous minimum element.

Now the heap contains the top-n elements, and by continually deleting the minimum element, we will get a sorted list of the top n elements.

To find the top n elements out of m elements in this way will use time O(m log(n)). This is to be compared with O(m log(m)) from the naïve solution.