Wednesday, September 18, 2013

Checking data errors with phylogenetic networks

Data-display networks can be used for a number of purposes, for example: Exploratory data analysis, Displaying data patterns, Displaying data conflicts, Summarizing analysis results, and Testing phylogenetic hypotheses. One of the more important, but currently under-valued, purposes is detecting data errors.

For instance, networks can help you detect data-sampling errors or outliers (eg. wrong specimen identification, diseased specimens), as well as data-collection errors (eg. extracting the wrong DNA, amplifying the wrong gene, sequencing artifacts) and data-processing errors (eg. data entry mistakes, incorrect alignment). These types of errors will likely show up as reticulations in a network, especially a splits graph.

Perhaps the most powerful use of such networks is in conjunction with a database of gold-standard or benchmark sequences. Comparison of all new sequences with the database would allow for a systematic quality check, because the network structure of the database is already known, and any deviation from this structure highlights potential problems ("identifying idiosyncrasies that cannot be attributed to natural evolutionary processes") or indicates novel sequence variation. Much of this process can be effectively automated by computer scripts.

To date, the champion of this use of networks has been Hans-Jürgen Bandelt, who has presented a number of interesting practical examples over the past dozen years. Below, I have included an annotated list of some of the more interesting publications in this area.

Bandelt H-J, Quintana-Murci L, Salas A, Macaulay V (2002) The fingerprint of phantom mutations in mitochondrial DNA data. American Journal of Human Genetics 71: 1150-1160. — The first to explicitly suggest using networks, and then use median and quasi-median networks to detect errors in published human mtDNA control-region datasets