Bookmark

OpenURL

Abstract

We show space-economical algorithms for finding maximal unique matches (MUM's) between two strings which are important in large scale genome sequence alignment problems. Our algorithms require only O(n) bits (O(n/log n) words) where n is the total length of the strings. We propose three algorithms for different inputs: In case the input is only the strings, their compressed suffix array, or their compressed suffix tree. Their time complexities are O(n log n), O(n log^&epsilon; n) and O(n) respectively, where &epsilon; is any constant between 0 and 1. We also show an algorithm to construct the compressed suffix tree from the compressed suffix array using O(n log^&epsilon; n) time and O(n) bits space.

Citations

...ay using O(n log # n) time and O(n) bits space. 1 Introduction The su#x tree is a quite useful data structure for solving string problems. Many problems can be e#ciently solved by using the su#x tree =-=[5]-=-. However the problem of using the su#x tree is its size. It is said that the su#x tree occupies about 17n bytes for a string of length n. Although a space-e#cient representation of the su#x tree [7] ...

...to develop space-economical alternatives to the su#x tree. Recently many such data structures were proposed, for example space-e#cient su#x trees [10], the compressed su#x array [4, 11], the FM-index =-=[3]-=-, data structures for bottom-up traversal of the su#x tree [6], and data structures for longest common prefixes [12]. However none of them has the same functions as the su#x tree. In this paper we con...

...es (MUM's) between two strings A and B. An MUM is a substring that appear once in both A and B and is not contained in any longer such substring. The MUM's are used in the algorithm of Delcher et al. =-=[1]-=- for aligning two long genome sequences. Details are described in Section 3. Although this problem can be solved in linear time by using the su#x tree of the strings A and B, it is not space-e#cient. ...

...by a factor of O(log n) because the su#x tree requires O(n) pointers, or equivalently O(n log n) bits. Our data structure include the compressed su#x array (CSA), parentheses representation of a tree =-=[9]-=- and the data structures for longest common prefixes (Hgt array) [12]. Note that these data structures do not store su#x links in a su#x tree. Therefore we may not be able to solve the problem e#cient...

...ently many such data structures were proposed, for example space-e#cient su#x trees [10], the compressed su#x array [4, 11], the FM-index [3], data structures for bottom-up traversal of the su#x tree =-=[6]-=-, and data structures for longest common prefixes [12]. However none of them has the same functions as the su#x tree. In this paper we consider the problem of finding maximal unique matches (MUM's) be...

...ample space-e#cient su#x trees [10], the compressed su#x array [4, 11], the FM-index [3], data structures for bottom-up traversal of the su#x tree [6], and data structures for longest common prefixes =-=[12]-=-. However none of them has the same functions as the su#x tree. In this paper we consider the problem of finding maximal unique matches (MUM's) between two strings A and B. An MUM is a substring that ...

...efore it is important to develop space-economical alternatives to the su#x tree. Recently many such data structures were proposed, for example space-e#cient su#x trees [10], the compressed su#x array =-=[4, 11]-=-, the FM-index [3], data structures for bottom-up traversal of the su#x tree [6], and data structures for longest common prefixes [12]. However none of them has the same functions as the su#x tree. In...

... [5]. However the problem of using the su#x tree is its size. It is said that the su#x tree occupies about 17n bytes for a string of length n. Although a space-e#cient representation of the su#x tree =-=[7]-=- has been proposed, it still occupies more than 10n bytes. Therefore it is di#cult to apply the su#x tree to solve large scale problems. The problem is severe in treating genome scale strings. For exa...

...efore it is important to develop space-economical alternatives to the su#x tree. Recently many such data structures were proposed, for example space-e#cient su#x trees [10], the compressed su#x array =-=[4, 11]-=-, the FM-index [3], data structures for bottom-up traversal of the su#x tree [6], and data structures for longest common prefixes [12]. However none of them has the same functions as the su#x tree. In...

...es, which is not realistic. Therefore it is important to develop space-economical alternatives to the su#x tree. Recently many such data structures were proposed, for example space-e#cient su#x trees =-=[10]-=-, the compressed su#x array [4, 11], the FM-index [3], data structures for bottom-up traversal of the su#x tree [6], and data structures for longest common prefixes [12]. However none of them has the ...