Wednesday, May 2, 2012

What should a database of datasets look like?

In the previous post, Steven made the very good point that we need a "database" of datasets that can be used to evaluate algorithms for phylogenetic networks. In biological terms, we currently lack a "gold standard" with which to compare the results of our data analyses. This is an important point, to which it is worth adding a few biological notes.

Validation (or evaluation) is a common analytical problem, not just in biology, and it has been addressed in many different circumstances. For example, another area within which I work is multiple DNA sequence alignment. Between 1999 and 2005 several different databases of empirical alignments were developed: BAliBase, OXBench, PREFAB, SABmark, and BRAliBase. These were created independently of each other, and were ostensibly designed for somewhat different purposes. Evaluations of computer algorithms since that time tend to have used several of these databases as their gold standard; and it is quite obvious that success using one of the databases does not imply success with any of the others.

This background has lead me to the conclusion that a database needs a structured set of data, with the structure addressing all of the different biological issues that are likely to be important. For example, phylogenetic networks are used to analyze datasets that contain "reticulation" events such as hybridization & introgression, lateral gene transfer, recombination, and genome fusion, but such events can be confounded by other events such as deep coalescence and hidden paralogy. So, a truly valuable database would have datasets that encompass not only all of these possibilities but their combinations as well. Chris Whidden's comment on Steven's post discusses the same important issue, but from the point of view of the mathematical requirements for finding the optimal network.

Such a database is a very ambitious goal. More to the point, I doubt that such a database could be created without widespread collaboration, as both Steven and Chris have emphasized.

BAliBase (referred to above) attempted to have a structured set of multiple alignments, but it is interesting to note that this structure was mostly ignored by subsequent users of the database. The users simply pooled all the different groups of alignments together and came up with an "average" success for the alignment algorithms, rather than discovering (as they would have done; see Morrison 2006) that different algorithms have different degrees of success depending on the particular characteristics of the dataset being analyzed. We should not make the same mistake when evaluating network algorithms.

I think that there have been four suggested approaches to acquiring datasets for evaluating tree/network algorithms (in order of increasing reality):

use datasets where the postulated events have been independently confirmed

experimentally create taxa with a known evolutionary history.

Option (1) has been used by many workers to evaluate tree-building algorithms, and the models have been readily adapted for phylogenetic networks (eg. Bandelt et al. 2000; Morin 2007; Woolley et al. 2008). Indeed, this has been the most common strategy for evaluating network algorithms, although there seems to be little consensus so far on what data-generation model(s) to use. The basic limitation here is that simulations (a) only show the success of the algorithms relative to how well they fit the model used to simulate the data, and (b) the relationship between the simulation model and the "real world" is unknown.

Option (2) has rarely been used for networks (eg. Vriesendorp 2007). The basic idea is to create "known" reticulations by combining parts of pre-existing datasets that lack reticulation signals. One can either combine whole datasets that contain mutually incompatible signals, or one can create individual taxa that have parts of their data taken from different reticulation-free datasets. This is a promising approach to "experimental" phylogenetics, although lack of prior experience means that we do not yet know how to use this strategy most effectively.

Option (3) is an obvious approach to collating data (McDade 1990), and has been used for evaluating tree-building algorithms (eg. McDade 1992; Leitner et al. 1996, Lemey et al. 2005). This has been used, for example, for fast-evolving organisms such as viruses, where the transmission history can sometimes be independently checked. Also, hybrids can often be experimentally verified; and Vriesendorp (2007) lists several such datasets for plants. The problem here is the degree to which the postulated reticulation events have been independently confirmed. A network reticulation may look like good evidence in favor of a "suspected" hybrid, for example, but it is not really independent evidence of anything in particular. I suspect that this weak sort of reasoning has been applied to far too many datasets used for the evaluation of network algorithms, where unsuitable datasets have been employed.

Option (4) has occasionally been used for evaluating tree-building algorithms (eg. Hillis et al. 1992; Cunningham et al. 1998; Sanson et al. 2002) but not, as far as I know, network algorithms. The idea is to experimentally manipulate some biological organisms in the laboratory to create a known evolutionary history, against which subsequent data analyses can be compared. Realistically, this restricts the datasets to viruses and phages, as these can be manipulated within a reasonable timeframe.

We need to think about which of these options we wish to adopt. Perhaps all of them?