Just something whacky to get you thinking. At least it got me pulling apart my hair, and now, I'm half bald.

Wednesday, September 15, 2010

A very fast approach to auto complete (or search suggestions)

You've seen search engines suggest queries when you begin typing the first few letters of your search string. This is being done by Duck Duck Go as well as Google (to name a few). This is typically done by maintaining a list of past queries and/or important strings that the search engine thinks are worthy of being suggested to a user that is trying to find something similar. These suggestions are effective only if the search engine spits them out very fast since these should show up on the screen before the user has finished typing what he/she wanted to type. Hence the speed with which these suggestions are made is very critical to the usefulness of this feature.

Let us consider a situation (and a possible way of approaching this problem) in which when a user enters the first few letters of a search query, he/she is presented with some suggestions that have as their prefix, the string that the user has typed. Furthermore, these suggestions should be ordered by some score that is associated with each such suggestion.

Approach-1:

Our first attempt at solving this would probably involve keeping the initial list of suggestions sorted in lexicographic order so that a simple binary search can give us the 2 ends of the list of strings that serve as candidate suggestions. These are all the strings that have the user's search query as a prefix. We now need to sort all these candidates by their associated score in non-increasing order and return the first 6 (say). We will always return a very small subset (say 6) of the candidates because it is not feasible to show all candidates since the user's screen is of bounded size and we don't want to overload the user with too many options. The user will get better suggestions as he/she types in more letters into the query input box.

We immediately notice that if the candidate list (for small query prefixes say of length 3) is large (a few thousand), then we will be spending a lot of time sorting these candidates by their associated score. The cost of sorting is O(n log n) since the candidate list may be as large as the original list in the worst case. Hence, this is the total cost of the approch. Apache's solr uses this approach. Even if we keep the scores bound within a certain range and use bucket sort, the cost is still going to be O(n). We should definitely try to do better than this.

Approach-2:

One way of speeding things up is to use a Trie and store (pointers or references to) the top 6 suggestions at or below that node in the node itself. This idea is mentioned here. This results in O(m) query time, where m is the length of the prefix (or user's search query).

However, this results in too much wasted space because:

Tries are wasteful of space and

You need to store (pointers or references to) 6 suggestions at each node which results in a lot of redundancy of data

We can mitigate (1) by using Radix(or Patricia) Trees instead of Tries.

Approach-3:

There are also other approaches to auto-completion such as prefix expansion that are using by systems such as redis. However, these approaches use up memory proportional to the square of the size of each suggestion (string). The easy way to get around this is to store all the suggestions as a linear string (buffer) and represent each suggestion as an (index,offset) pair into that buffer. For example, suppose you have the strings:

We can do better by using Segment (or Interval) Trees. The idea is to keep the suggestions sorted (as in approach-1), but have an additional data structure called a Segment Tree which allows us to perform range searches very quickly. You can query a range of elements in Segment Tree very efficiently. Typically queries such as min, max, sum, etc... on ranges in a segment tree can be answered in O(log n) where n is the number of leaf nodes in the Segment Tree. So, once we have the 2 ends of the candidate list, we perform a range search to get the element with the highest score in the list of candidates. Once we get this node, we insert this range (with the maximum score in that range as the key) into the priority queue. The top element in the queue is popped and split at the location where the element with the highest score occurs and the scores for the 2 resulting ranges are computed and pushed back into the priority queue. This continues till we have popped 6 elements from the priority queue. It is easy to see that we will have never considered more than 2k ranges (here k = 6).

Hence, the complexity of the whole process is the sum of:

The complexity for the range calculation: O(log n) (omitting prefix match cost) and

The complexity for a range search on a Segment Tree performed 2k times: O(2k log n) (since the candidate list can be at most 'n' in length)

Update (29th October, 2010): I have implemented the approach described above in the form of a stand-alone auto-complete server using Pyhton and Mongrel2. You can download it from here. (11th February, 2012): lib-face is now called cpp-libface and has a new home at github!

Your statement about Suggest Trees ("Though these look promising, treaps in practice can degenerate into a linear list.") refers to a old version of SuggestTree. Since 2011, the structure is not based on a treap, but on a compressed ternary search tree with precomputed top-k lists. While the additional space costs are small, this guarantees much better performance than the slow Segment Tree approach you are recommending.

@Nicolai The treap (randomized heap keys) has a very low probability (polynomially low in fact) of degenerating into a List, but the way treaps are used here is different. An adversary can manipulate the values and priorities to get to the degenerate case pretty easily.

With respect to the ternary tree representation, do you know if 'k' needs to be fixed when the structure is built up (it seems so), or can we query for any 'k' once the structure is built up?

If the 'k' has to be pre-decided, then the constant cost per node (space) is at least O(k), which seems to be pretty high.

"With respect to the ternary tree representation, do you know if 'k' needs to be fixed when the structure is built up (it seems so), or can we query for any 'k' once the structure is built up?"

With a SuggestTree, 'k' has to be fixed when the structure is built.

"If the 'k' has to be pre-decided, then the constant cost per node (space) is at least O(k), which seems to be pretty high."

Since most of the nodes are at the bottom of the tree, most of them only hold a very short suggestion list. In my tests with real world data, the average list length was about 2 for k = 10 (and the number of nodes was about 1.3 n). So O(k) is only a theoretical upper bound for the space cost of a node (like, for example, O(n) is an upper bound for the time cost of searching in a hash table).

Since most of the nodes are at the bottom of the tree, most of them only hold a very short suggestion list. In my tests with real world data, the average list length was about 2 for k = 10 (and the number of nodes was about 1.3 n). So O(k) is only a theoretical upper bound for the space cost of a node (like, for example, O(n) is an upper bound for the time cost of searching in a hash table).

I thought that even internal nodes would hold suggestion lists since internal nodes would correspond to prefixes of strings. Isn't it so?

Also, I don't understand the bit "Since most of the nodes are at the bottom of the tree" since I would assume that if a tree has a fanout of 3 (ternary tree), then the number of leaf nodes is O(number of internal nodes) - because of a constant fanout.

I thought that even internal nodes would hold suggestion lists since internal nodes would correspond to prefixes of strings. Isn't it so?

Yes, even internal nodes hold a suggestion list. Did you read the SuggestTree documentation? Nodes (prefixes) with the same completions are compressed into one node. So for each suggestion inserted into the tree, at most one new node is added and at most one existing node is split into two nodes. This is why usually most of the nodes "are at the bottom of tree" (have no middle child node) and thus hold a suggestion list of length 1.

Perhaps this is easier to understand if you do not imagine a ternary search tree, but a simpler trie data structure where the child nodes of a node are not ordered as a binary search tree. The number of nodes in such a tree and the length of the suggestion lists would be the same.

Yes, I have read the documentation, but still an unclear on a few things.

for each suggestion inserted into the tree, at most one new node is added and at most one existing node is split into two nodes

I understand this. In fact, a direct consequence of this is that every leaf node has just 1 suggestion in its list.

However, if the branching happens at depth 'd', then the suggestion lists of *potentially* 'd' nodes might need to be updated when such an insert happens and a new string might find itself in at most 'd' suggestion lists. This is where I don't quite understand the claim of a small number of entries in suggestion lists.

I'm not talking empirically - talking about a worst case. If you can prove the average (or expected) case, then it would satisfy me.

Again, imagine a simpler trie data structure with direct edges between a node and its child nodes. Because of branching, the number of leaf nodes (suggestion lists of length 1) usually vastly exceeds the number of internal nodes with a long suggestion list. Or, to put it differently, an "average" (middle-ranking) suggestion is usually listed only at a very few nodes.Of course, one can construct a worst-case trie that branches only near the root. The average length of the suggestion lists in such a trie would be approximately k/2. But on the other hand, the number of nodes would be reduced to almost n. And if one assumes that the probability of a non-branching node is 1/a, with a being the size of the alphabet, the probability of getting such a trie is practically zero for large n.

@Nicolai When you say why hash tables with a worst-case time complexity of O(n) do you mean space complexity or total cost to query O(n) elements in aggregate? If it is the former, then any hash table (even with a single bucket) would have just O(n) space usage.

I did not find any use of segment tree.If we are aware of the weight of each word and have to report top k words with maximum weight then i can use min-heap of size k.Insert k elements from the list.Now for n-k elements check if list[i]( where k<i<n ) weight is greater than top of heap.If yes then remove top and add list[i] to heap.Time complexity :O(k)+ O((n-k) log k)

> My question is how do we handle dynamic updates to segment tree ? If we want to update the system with new phrases do we have to rebuild the segment tree entirely from scratch ?

Instead of using a segment tree as described in the paper, you can augment any balanced binary tree to emulate a segment tree. Such a structure can be easily updated (i.e. entries can be added in the middle) since traversal uses the sizes of the left/right subtrees to determine which branch to take.

The simplest balanced binary tree you could use here is a treap, and it would be almost trivial to implement that with reasonably good expected bounds on running time.