AN ARCHIVE OF EVOLGEN, WHICH CAN NOW BE FOUND AT SCIENCE BLOGS (http://scienceblogs.com/evolgen)

Friday, May 27, 2005

What about distance based methods?

DNA and protein sequence data from multiple species allow us to reconstruct the evolutionary relationships between those sequences -- a field known as phylogenetics. One of the earliest methods for constructed phylogenetic trees was borrowed from morphologists. This algorithm (known as the unweighted pair group method with arithmetic mean, or UPGMA) relies on a distance matrix to determine the relationships between sequences.

A

B

C

D

E

B

2

C

4

4

D

6

6

6

E

6

6

6

4

F

8

8

8

8

8

The UPGMA algorithm assumes that all lineages are evolving at the same rate (oftentimes unrealistic), and it produces what is known as a "rooted tree." This algorithm is rarely used with molecular data due to its limitations. UPGMA has been replaced by the neighbor joining algorithm as the most favored distance based method (i.e., a method using a distance matrix). The neighbor joining algorithm allows for unequal rates of evolution between sequences and is designed specifically for molecular data.

Distance based methods produce one tree based on their algorithm. This is in contrast to maximum parsimony methods which explore all possible trees for the one with the fewest amount of evolutionary steps. (NOTE: With large data sets it becomes unfeasible to explore every tree, so search algorithms have been developed to expedite the process.) It is unclear whether or not the most parsimonious tree is the most probable, so some people question the biological meaning of parsimony methods.

Recently, with the advances in computing, it has become possible to construct phylogenies using maximum likelihood methods. Tree building using maximum likelihood requires one to choose a model of molecular evolution, and an entire field of research has developed concerning which models are appropriate for different scenarios.

A paper by Kolaczkowski and Thornton in Nature last October on the performance of parsimony and likelihood methods has inspired somediscussion in the most recent edition of Trends in Genetics. Previous research showed that parsimony methods were subject to inaccurate results with certain branch length combinations. Kolaczkowsi and Thornton, however, found that likelihood methods perform much poorer than parsimony when evolutionary rates vary substantially over time. Interestingly, this paper does not include a comparison with any distance based methods.

Mike Steel has written a review of the Kolaczkowsi and Thornton paper for Trends in Genetics. In his review, Steel points out that the accuracy of likelihood approaches depend on choosing the appropriate model of molecular evolution, and Kolaczkowsi and Thornton's scenario must be considered as a plausible model. More importantly, he questions over-parameterizing one's models. Real life is extremely complex and can be described by an infinite amount of parameters; in order to model this complexity, however, one must determine which parameters are the most important and only include those that are necessary in one's model. Steel concludes in his review,

"In summary, ‘better, more realistic models’ should not mean ‘more parameter-rich models’ – these might ‘capture’ more of reality, but only when the numerous parameters that are required are close to their correct values. However, the power of a given amount of data to estimate several parameters accurately is generally low. Modest parameter models that capture the main features of the sequence data are more useful – learning how DNA evolves is crucial to this task and a challenge for the future."

If one is concerned with simplifying one's models, one should shy away from likelihood and rely on simple distance methods. I am not advocating UPGMA, but neighbor joining should always be considered a viable option. It is a shame that these papers and reviews only consider parsimony and likelihood methods without even acknowledging the existence of distance based approaches.

I am not arguing that one should only use distance based approaches, only that they should still be considered a viable method. Distance methods do rely on models of nucleotide or amino acid substitution to construct the distance matrices, but they are nowhere near as parameter dependent as likelihood approaches. It has always been explained me (and I try to echo this whenever necessary) that if your data is robust, you should recover the same tree topology regardless of which approach you use. Therefore, you should try multiple methods (distance based, maximum parsimony, maximum likelihood) and determine the similarities between the different phylogenies. Regions of the tree that disagree using different methods should not be regarded with high confidence. Ignoring a well established approach only weakens your conclusions. Are distance methods dead in the eyes of tree builders or just in the minds of modelers and parsimony and likelihood advocates?

-----------------NOTE: I may be biased in my opinion due to spending too much time around Masatoshi Nei's group. For those unfamiliar with the field, Nei developed the neighbor joining algorithm and has been a staunch advocate of distance based methods. He has always argued, "Why use a complex method when a much simpler one works just as well?" In other words, keep it simple, stupid.