PELICAN : a PipELIne, including a novel redundancy-eliminating algorithm, to Create and maintain a topicAl family-specific Non-redundant protein database

by Andersson, Christoffer

Abstract (Summary)

The increasing number of biological databases today requires that users are able to search more efficiently among as well as in individual databases. One of the most widespread problems is redundancy, i.e. the problem of duplicated information in sets of data. This thesis aims at implementing an algorithm that distinguishes from other related attempts by using the genomic positions of sequences, instead of similarity based sequence comparisons, when making a sequence data set non-redundant. In an automatic updating procedure the algorithm drastically increases the possibility to update and to maintain the topicality of a non-redundant database. The procedure creates a biologically sound non-redundant data set with accuracy comparable to other algorithms focusing on making data sets non-redundant