Abstract

Background

S. cerevisiae, A. thaliana and M. musculus are well-studied organisms in biology and the sequencing of their genomes was completed
many years ago. It is still a challenge, however, to develop methods that assign biological
functions to the ORFs in these genomes automatically. Different machine learning methods
have been proposed to this end, but it remains unclear which method is to be preferred
in terms of predictive performance, efficiency and usability.

Results

We study the use of decision tree based models for predicting the multiple functions
of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision
trees. These can simultaneously predict all the functions of an ORF, while respecting
a given hierarchy of gene functions (such as FunCat or GO). We present new results
obtained with this algorithm, showing that the trees found by it exhibit clearly better
predictive performance than the trees found by previously described methods. Nevertheless,
the predictive performance of individual trees is lower than that of some recently
proposed statistical learning methods. We show that ensembles of such trees are more
accurate than single trees and are competitive with state-of-the-art statistical learning
and functional linkage methods. Moreover, the ensemble method is computationally efficient
and easy to use.

Conclusions

Our results suggest that decision tree based methods are a state-of-the-art, efficient
and easy-to-use approach to ORF function prediction.