Interesting search topics and random musings

Monday Apr 28, 2008

Anyways, heaps are a ferociously useful data structure, especially in a search engine. Minion uses heaps in two crucial places.

The first is in the results fetching code. The problem we need to solve is: given a set of documents with associated scores, how do we choose the n documents with the highest scores? The answer is to make a heap with n elements that's organized by the score, with the lowest score at the root of the heap. When considering a document, if the document has a score greater than the score at the root of the heap, we remove the root of the heap and add the new document, since it has a higher score than the document from the top n with the lowest score.

Doing this saves a lot of the comparisons that you would generate sorting a million document return set to find the top 10 documents.

The other crucial place that we use a heap is when merging dictionaries. To merge n dictionaries, we make a heap of iterators for the dictionaries organized by the term from the dictionary that's at the head of the iterator. For a given term, you can poll the heap to get all of the iterators (and therefore all of the dictionaries) that contain that term and combine the information into the new dictionary.

We use heaps in a lot of other places too. In Minion, we use the java.util.PriorityQueue class for our heap, and Netbeans says that we have 90 usages of the PriorityQueue class!

Interesting fact: Mark is a Googler, so I expect he knows how heaps are used in search engines, and it turns out that I knew his wife when I was a grad student.

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.