In this project, we are going to construct a biological sequence database that support approximate pattern matching. The approximate pattern matching is a fundamental topic in biological research. It is required to search a related pattern in biological sequence, like DNA and protein sequences. A pair of sequence is said to be similar if their difference is less than a given threshold. The most classical solution to this problem is dynamic programming approach. There are several good computation time algorithms available to match a pattern in a sequence. However, most of them concern the online version of the problem that they assume both pattern and text are not known before. Sequential scanning of the text is required for those algorithms. Moreover, those algorithms do not consider the I/O requirement and the secondary storage management of the text. As the biological sequence data is rapidly growing, a database for large data is required. Our goal of this project is to construct a database that support approximate sequence matching with good memory management, low I/O cost and short overall runtime.
In our project, we studied several algorithm and data structure on the pattern matching. We developed a system that support approximate pattern matching by using q-gram index database with q-gram filter proposed in the previous study. In the project, we have implemented the original q-gram filter. The filter has several problems on the memory requirement and accuracy. We did some improvement of the filter and compare them with the original one. From the result of the experiment, it showed that our improved filters are more efficient than the original one.