Years of Citing Articles

Bookmark

OpenURL

Abstract

Abstract. One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zero-order entropy of the text. The number of occurrences of a pattern of length m can be computed in O(m log n) time. Most notably, the CSA does not need the text separately available to operate. The CSA simulates a binary search over the suffix array, where the query is compared against text substrings. These are extracted from the same CSA by following irregular access patterns over the structure. Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FM-index of Ferragina and Manzini [FOCS 2000]. He has shown that the CSA can be searched in O(m) time whenever σ = O(polylog(n)). In this paper we consider some other consequences of backward searching applied to CSA. The most remarkable one is that we do not need, unlike all previous proposals, any complicated sub-linear structures based on the four-Russians technique (such as constant time rank and select queries on bit arrays). We show that sampling and compression are enough to achieve O(m log n) query time using less space than the original structure. It is also possible to trade structure space for search time. Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m log B n) disk accesses, being B the disk block size. Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.

Citations

...munication need. Upon completing the search, it sends R(pm−1pm) to the processor responsible for pm−2 and so on. After m communication steps exchanging O(1) data, we have the answer. In the BSP model =-=[24]-=-, we need m supersteps of O(log n) CPU work and O(1) communication each. In comparison, the CSA needs O(m log n) supersteps of O(1) CPU and communication each, and the basic suffix array needs O(log n...

... the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix array =-=[10]-=-. Still, the space complexity does not change. Moreover, the searches take O(m log n) time with the suffix array (this can be improved to O(m + log n) using twice the original amount of space [10]). T...

...ffix of T is titi+1 . . . tn). These kind of indexes are called full-text indexes. Optimal query time, which is O(m) as every character of P must be examined, can be achieved by using the suffix tree =-=[25,12,23]-=- as the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix arr...

...ffix of T is titi+1 . . . tn). These kind of indexes are called full-text indexes. Optimal query time, which is O(m) as every character of P must be examined, can be achieved by using the suffix tree =-=[25,12,23]-=- as the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix arr...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...n of the CSA is possible: All previous proposals heavily rely on sublinear structures based on the so-called four-Russians technique [1] to support constant time rank and select queries on bit arrays =-=[8,13,2]-=- (rank(i) to find out how many bits are set before position i, and select(j) to find out the position of the jth bit from the beginning). We show that these structures are not needed for an efficient ...

...ffix of T is titi+1 . . . tn). These kind of indexes are called full-text indexes. Optimal query time, which is O(m) as every character of P must be examined, can be achieved by using the suffix tree =-=[25,12,23]-=- as the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix arr...

... most important consequence of this is that a simpler implementation of the CSA is possible: All previous proposals heavily rely on sublinear structures based on the so-called four-Russians technique =-=[1]-=- to support constant time rank and select queries on bit arrays [8,13,2] (rank(i) to find out how many bits are set before position i, and select(j) to find out the position of the jth bit from the be...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...n of the CSA is possible: All previous proposals heavily rely on sublinear structures based on the so-called four-Russians technique [1] to support constant time rank and select queries on bit arrays =-=[8,13,2]-=- (rank(i) to find out how many bits are set before position i, and select(j) to find out the position of the jth bit from the beginning). We show that these structures are not needed for an efficient ...

...e requirement of full-text indexes has raised the interest on indexes that occupy the same amount of space as the text itself, or even less. For example, the Compressed Suffix Array (CSA) of Sadakane =-=[19]-=- takes in practice the same amount of space as the text compressed with a zero-order model. Moreover, the CSA does not need the text at all, since the text is included in the index. Existence and coun...

...red suffix. The extraction ends at letter G, and hence the suffix does not correspond to an occurrence, and the search is continued to the left of the current point. 4 Backward Search on CSA Sadakane =-=[20]-=- has proposed using backward search on the CSA. Let us review how this search proceeds. We use the notation R(X), for a string X, to denote the range of suffix array positions corresponding to suffixe...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to operate. Recently, several space-optimal self-indexes have been proposed =-=[5,6, 4]-=-, whose space requirement depends on the k-th order empirical entropy with constant factor one (except for the sub-linear parts). These indexes achieve good query performances in theory, but they are ...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...kes still O(log B n) time, but we perform only ⌈m/ℓ⌉ of them. One obstacle to a secondary memory CSA implementation might be in building such a large CSA. This issue has been addressed satisfactorily =-=[21]-=-. 8 A Distributed Implementation Distributed implementations of suffix arrays face the problem that not only the suffix array, but also the text, are distributed. Hence, even if we distribute suffix a...

...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o...

...ext, are distributed. Hence, even if we distribute suffix array A according to lexicographical intervals, the processor performing the local binary search will require access to remote text positions =-=[17]-=-. Although some heuristics have been proposed, log n remote requests for m characters each are necessary in the worst case. The original CSA does not help solve this. If array Ψ is distributed, we wil...