A Wavelet Tree Based FM-Index for Biological Sequences in SeqAn

Abstract

The technological development in the field of genome research has resulted in a massive generation of data that has to be stored and analyzed. The enormous amount of information demands special data structures and algorithms for an efficient analysis. Such an analysis often requires the identification of interesting sequences in genomes, which can be realized using full-text indices. Until recently, the major problem of this approach was its memory consumption, which now can be overcome using the well known FM-index. Therefore, in this thesis we extended the software library SeqAn that provides data structures and algorithms for analyzing biological sequences, with sophisticated FM-index versions designed for fast and memory efficient pattern search. We show that in comparison with existing FM-index implementations our variants are not only competitive to other approaches, but also outperform them.