Citation

Metadata

Abstract

Background: The Generalized Hidden Markov Model (GHMM) has proven a useful framework
for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and
probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of
homology information to improve prediction accuracy, extensions to the basic GHMM model are
being explored as possible ways to integrate this homology information into the prediction process.
Particularly prominent among these extensions are those techniques which call for the
simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly
the computational cost of prediction and highlighting the importance of speed and memory
efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of
implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be
expected that this task will only grow more onerous as our models increase in complexity.
Results: As a first step toward addressing the implementation challenges of these next-generation
systems, we describe in detail two software architectures for GHMM-based gene finders, one
comprising the common array-based approach, and the other a highly optimized algorithm which
requires significantly less memory while achieving virtually identical speed. We then show how both
of these architectures can be accelerated by a factor of two by optimizing their content sensors.
We finish with a brief illustration of the impact these optimizations have had on the feasibility of
our new homology-based gene finder, TWAIN.
Conclusions: In describing a number of optimizations for GHMM-based gene finders and making
available two complete open-source software systems embodying these methods, it is our hope
that others will be more enabled to explore promising extensions to the GHMM framework,
thereby improving the state-of-the-art in gene prediction techniques.