Bookmark

OpenURL

Abstract

We describe a data structure that uses O(n)-word space and reports k most relevant documents that contain a query pattern P in optimal O(|P | + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between two occurrences of P in a document. We show how to reduce the space of the data structure from O(n log n) to O(n(log σ+log D+log log n)) bits, where σ is the alphabet size and D is the total number of documents. 1

Citations

...e locus of a string P is the highest node v such that P is a prefix of path(v). Every occurrence of P corresponds to a unique leaf that descends from its locus. We refer the reader to, e.g., Gusfield =-=[18]-=- for an extensive description of the generalized suffix tree and related notions. We say that a leaf l is marked with document d if the suffix stored in l belongs to d. An internal node v is marked wi...

...d challenge is to move on towards the bag-of-words paradigm of information retrieval. Our model easily handles single-word searches, and also phrases (which is quite complicated with inverted indexes =-=[39, 3]-=-, particularly if their weights have to be computed). Handling a set of words or phrases, whose weights within any document d must be combined in some form (for example using the tf × idf model) is mo...

...cause the size per point in Lemma 3.1 is O(log m), and our widths decrease doubly exponentially. As a query may span several stripes, a structure similar to the one used in the classical RMQ solution =-=[6]-=- is used. This gives linear space for stripes of width up to Ω(log δ n). Smaller ones are solved with universal tables. In addition to the global array storing p.y for each p.x, we use another array s...

... the need to traverse W in order to find out the real weights, so as to compare weights from different nodes. However, those weights can be computed in time O(log ε n) and using O(n log n) extra bits =-=[10, 31, 9]-=-. The operations on the priority queue can be carried out in O((log log n) 2 ) time [1]. Thus we have the following result. Lemma 7.1. Given a grid of n × n points, there exists a data structure that ...

...queries; their structure uses O(n) words of space and reports all docc documents that contain P in O(|P | log D + docc) time, where D is the total number of documents in the collection. Muthukrishnan =-=[27]-=- presented a data structure that uses O(n) words of space and answers document listing queries in O(|P | + docc) time. Muthukrishnan [27] also initiated the study of more sophisticated problems in whi...

..., but still an edge of a minitree can be labeled with a string of length Θ(n). Instead of representing the contracted tree and the minitrees separately, we use Sadakane’s compressed suffix tree (CST) =-=[35]-=- to represent the topology of the whole T in O(n) bits, and a recent compressed representation [5] of the global suffix array (SA) of the string collection, which takes O(n log σ) bits. This SA repres...

... insufficient. We have shown that our structure can use, instead, O(n(log σ + log D + log log n)) bits. There is a whole trend of reduced-space representations for general document retrieval problems =-=[34, 37, 16, 21, 15, 11, 4]-=-. Most of them make use of the so-called document array [27]. This approach has been shown to be very competitive in space and time for top-k problems, even using heuristic solutions [11, 29]. The spa...

... insufficient. We have shown that our structure can use, instead, O(n(log σ + log D + log log n)) bits. There is a whole trend of reduced-space representations for general document retrieval problems =-=[34, 37, 16, 21, 15, 11, 4]-=-. Most of them make use of the so-called document array [27]. This approach has been shown to be very competitive in space and time for top-k problems, even using heuristic solutions [11, 29]. The spa...

...fferent nodes. However, those weights can be computed in time O(log ε n) and using O(n log n) extra bits [10, 31, 9]. The operations on the priority queue can be carried out in O((log log n) 2 ) time =-=[1]-=-. Thus we have the following result. Lemma 7.1. Given a grid of n × n points, there exists a data structure that uses O(n) words of space and reports k most highly weighted points in a range Q = [a, b...

...ge increases by O(v log m) bits of space. Now we sort the v points in x-coordinate order, build the sequence Y [1..v] of their y-coordinates, and build a Range Minimum Query (RMQ) data structure on Y =-=[13]-=-. This structure requires only O(v) bits of space, does not need to access Y after construction (so we do not store Y ), and answers in constant time the query rmq(c, d) = arg minc≤i≤d Y [i] for any c...

... data structures. For example, the height of our grids was bounded by O(n), but it corresponds to the height of the suffix tree. This is O(log n) on average for any text generated from a Markov model =-=[36]-=-, and indeed small in most practical cases. A common pitfall to practicality is space usage. Even achieving linear space (i.e., O(n log n) bits) can be insufficient. We have shown that our structure c...

... insufficient. We have shown that our structure can use, instead, O(n(log σ + log D + log log n)) bits. There is a whole trend of reduced-space representations for general document retrieval problems =-=[34, 37, 16, 21, 15, 11, 4]-=-. Most of them make use of the so-called document array [27]. This approach has been shown to be very competitive in space and time for top-k problems, even using heuristic solutions [11, 29]. The spa...

...e mind(P, d), the minimum distance between two occurrences of P in d, and docrank(d), an arbitrary static rank assigned to a document d. Some more complex measures have also been proposed. Hon et al. =-=[19]-=- presented a solution for the top-k document retrieval problem for the case when the relevance measure is tf (P, d). Their data structure uses O(n log n) words of space and answers queries in O(|P | +...

...phrases, whose weights within any document d must be combined in some form (for example using the tf × idf model) is more challenging. We are only aware of some very preliminary results for this case =-=[28, 20]-=-. It is interesting to note that our online result allows simulating the left-to-right traversal, in decreasing weight order, of the virtual list of occurrences of any string pattern P . Therefore, fo...

...in Theorem 1.1. For instance, we might be interested in reporting all documents d with tf (P, d) × idf (P ) ≥ τ, where idf (P ) = log(N/df (P )) and df (P ) is the number of documents where P appears =-=[3]-=-. Using the O(n)-bit structure of Sadakane [34], we can compute idf (P ) in O(|P |) time. To answer the query, we use our data structure of Theorem 1.1 in online mode on measure tf : For every reporte...

... the need to traverse W in order to find out the real weights, so as to compare weights from different nodes. However, those weights can be computed in time O(log ε n) and using O(n log n) extra bits =-=[10, 31, 9]-=-. The operations on the priority queue can be carried out in O((log log n) 2 ) time [1]. Thus we have the following result. Lemma 7.1. Given a grid of n × n points, there exists a data structure that ...