Citation and License

Genome Biology 2006, 7(Suppl 1):S9
doi:10.1186/gb-2006-7-s1-s9

Published: 7 August 2006

Abstract

Background

Predicting complete protein-coding genes in human DNA remains a significant challenge.
Though a number of promising approaches have been investigated, an ideal suite of
tools has yet to emerge that can provide near perfect levels of sensitivity and specificity
at the level of whole genes. As an incremental step in this direction, it is hoped
that controlled gene finding experiments in the ENCODE regions will provide a more
accurate view of the relative benefits of different strategies for modeling and predicting
gene structures.

Results

Here we describe our general-purpose eukaryotic gene finding pipeline and its major
components, as well as the methodological adaptations that we found necessary in accommodating
human DNA in our pipeline, noting that a similar level of effort may be necessary
by ourselves and others with similar pipelines whenever a new class of genomes is
presented to the community for analysis. We also describe a number of controlled experiments
involving the differential inclusion of various types of evidence and feature states
into our models and the resulting impact these variations have had on predictive accuracy.

Conclusion

While in the case of the non-comparative gene finders we found that adding model states
to represent specific biological features did little to enhance predictive accuracy,
for our evidence-based 'combiner' program the incorporation of additional evidence
tracks tended to produce significant gains in accuracy for most evidence types, suggesting
that improved modeling efforts at the hidden Markov model level are of relatively
little value. We relate these findings to our current plans for future research.