Phylogenetic Tree Reconciliation: the Species/Gene Tree Problem

by Jean-François Dufayard, Laurent Duret and François Rechenmann

An algorithm to find gene duplications in phylogenetic trees in order to improve gene function inferences has been developed in a collaboration between the the Helix team from INRIA Rhône-Alpes and the biométrie moléculaire, évolution et structure des génomes team from the UMR biométrie, biologie évolutive de Lyon. The algorithm and its software is applicable to realistic data, especially n-ary species tree and unrooted phylogenetic tree. The algorithm also takes branch lengths into account.

With appropriate algorithms, it is possible to deduce species history studying genes sequences. Genes are indeed subject to mutations during the evolution process, and hence the corresponding (homologous) sequences in different species differ from each other. A tree can be built from the sequences comparison, relating genes and species history: a phylogenetic tree. Sometimes a phylogenetic tree disagrees with the species tree (constructed for example from anatomical and paleontological considerations). These differences can be explained by a gene being duplicated in a genome, and each copy having its own history. Consequently, a node in a phylogenetic tree can be the division of an ancestral species into two others, as well as a gene duplication. More precisely, in a family of homologous genes, paralogous genes have to be distinguished from orthologous genes. Two genes are orthologous if the divergence from their last common ancestor results from a speciation event, while they are paralogous if the divergence results from a duplication event.

It is essential to make the distinction because two paralogous genes are less likely to have preserved the same function than two orthologues. Therefore, if one wants to predict gene function by homology between different species, it is necessary to check whether genes are orthologous or paralogous to increase the accuracy of the prediction.

An algorithm has been developed which can deduce this information by comparing gene trees with the taxonomy of different species. Currently, the algorithm is applicable to gene families issued of vertebrates. It can be applied to realistic data: species trees may not necessarily be binary, and the tree structures are compared as well as their branch lengths. Finally, phylogenetic trees can be unrooted: the number of duplications is a good criterion to make a choice, and with this method the algorithm is able to root phylogenetic trees.

Software has been developed to use the algorithm. It has been written in JAVA 1.2, and the graphical interface permits an easy application to realistic data. An exhaustive species tree can be easily seen and edited (tested with more than 10,000 leaves). Results can be modified and saved.