Wednesday, September 26, 2012

The interpretation of an evolutionary network is confounded by the fact that descendants of reticulation nodes have complex ancestry. Therefore, the concept of a Most Recent Common Ancestor (MRCA) is not as straightforward as it is for a tree, as there may be multiple paths from any one descendant back to its ancestors. This creates several possible interpretations of what we might mean by a MRCA.

Figure 1 illustrates the calculation of the MRCA in a tree of five taxa (A-E), showing the MRCA of taxa C and D. We simply trace each of the descendant taxa backward along the branches towards the root, and the ancestral node where all of these traces first intersect is the MRCA of those taxa.

Figure 1.

Figure 2 illustrates a more complex history, involving two hybridization events. The incoming branches to the reticulation nodes have arrows, for emphasis. The figure also recognizes several possible interpretations of the MRCA of taxa C and D (see Huson and Rupp 2008; Fischer and Huson 2010).

A conservative definition of the MRCA (or a stable MRCA) is the intersection of all paths from the descendants to the root, so that any reticulation pushes the MRCA back towards the root. In this example it pushes the MRCA all the way to the root. Alternatively, we could define the Lowest Common Ancestor (or the minimal common ancestor) as the shared ancestor that is furthest from the root along any path. That is, the LCA is not an ancestor of any other common ancestor of the taxa concerned.

Figure 2.

In the mathematical terminology of lattices, which can have an algebraic or order theoretic definition, the Conservative MRCA is called the Least Lower Bound (LLB) and the LCA is called the Greatest Lower Bound (GLB).

We could also have a biological compromise between these two mathematical concepts and recognize a Fuzzy MRCA, in which only a specified proportion of the paths (representing some proportion of the genomes) needs to be accommodated by the MRCA, thus keeping the MRCA close to the main collection of descendants (Fischer and Huson 2010). In this example, the Fuzzy MRCA represents 75% of the genome of taxon C and 100% of the genome of taxon D. (The Conservative MRCA represents 100% for both taxa, by definition; and in this example the LCA represents 50% of the genome of taxon C and 100% of the genome of taxon D.)

Figure 3.

However, neither the Fuzzy MRCA nor the LCA is necessarily unique, although the Conservative MRCA will always be unique. Figure 3 shows an example where there are two independent LCAs of taxa C and D. Neither of these LCAs is an ancestor of the other, as required by the definition, and so they are both equal candidates as LCA. Each one represents 50% of the genome for both taxa C and D.

In terms of a lattice, Figure 2 is called a lower semi-lattice (or meet semi-lattice), because every pair of nodes has only one GLB, whereas Figure 3 is not a semi-lattice, because at least one node pair has more than one GLB.

This leads to the biological question of how we are best to interpret the MRCA in situations such as that represented by Figure 3. This is a question that does not yet seem to have been addressed by biologists. Figure 3 does not represent an impossible evolutionary history, although it may be an unusual one because one lineage hybridizes with another lineage twice, presumably at different times.

The lack of a unique LCA is clearly problematic, as it almost defeats the purpose of the concept of a MRCA. It would certainly make life easier if we could restrict evolutionary networks to the class of lower semi-lattices.

An alternative is to restrict the MRCA concept to the Conservative MRCA. However, it is easy to imagine situations where this pushes the MRCA so far towards the root of the network as to be uninformative, especially in cases involving horizontal gene transfer, which can occur between widely separated evolutionary groups. If we insist that a eukaryote MRCA represent 100% of the genome, and we include non-nuclear genomes in the calculation, then the Conservative MRCA creates an extreme theoretical problem.

A Fuzzy MRCA may be the best compromise between these two extremes, although there are obvious practical issues for obtaining agreement on how much of the genome history is to be discounted from the MRCA.

Monday, September 24, 2012

The spate of phylogenetic tree tattoos continues. As before, this is dominated by Charles Darwin's best-known sketch from his Notebooks (the "I think" tree) (see also Tattoo Monday III, V and IX). However, the "molecular tree" re-appears here (see Tattoo Monday IV); and finally there is a real cladogram tattoo for the purists.

Wednesday, September 19, 2012

I noted in an earlier post that studies of the dog genealogy seem to follow historical precedent, with trees being used for the analysis of whole-genome data and networks for the analysis of mitochondrial DNA data. However, domestic dog breeds do not have a simple tree-like ancestry, due to the cross-breeding involved in creating new breeds, and so the use of a tree model is inadequate. This was known long before the advent of molecular data, from comparative studies of phenotypes rather than genotypes, but genetic data have allowed us to attack this issue in a more directly quantitative way.

Anthropologists have traditionally used phylogenetic trees, especially when assessing the historical development of human "races", which have been assumed to maintain a strong degree of separation (see this earlier post). Clearly, networks would be more appropriate representations of history in many cases, especially where there is gene flow within a species or set of closely related species. This particularly applies to those fossils most closely related to humans, such as those of the Neandertals, a group of archaic hominins from the Middle Pleistocene who ranged right across Europe into western Siberia, but whose fossil record stops about 30,000 years ago (during the Late Pleistocene).

There have been a number of recent blog comments about the desirability of network analyses in historical anthropology (e.g. Dalton Luther, Jonathan Marks, PZ Myers, Dienekes Pontikos). As noted by Jason Antrosio, there "is a need to better understand and portray evolutionary complexity. With all the reports of Neandertal and Denisovan admixture, with all the emphasis on multispecies ethnography, with new looks at hybridization, we really must get away from the overly simplistic tree diagrams and taxonomies that have so long dominated evolutionary imagery". (Denisovans consist of a hominin fossil finger bone and some teeth from the Denisova Cave, in Siberia, which have yielded nucleotide sequences strikingly different from those of both Neandertals and modern humans. As Todd Wood has noted: "they're a genome in search of a fossil record.")

Here, I use networks to evaluate some of the available genotype data for the relationships between humans, Neandertals and Denisovans.

Nuclear genome

There is currently very little whole-genome data for ancient hominins, but what there is clearly shows "that Neandertals, Denisovans, and others labelled archaic are in fact an interbreeding part of the modern human lineage ... There has been extensive admixture across modern humans for tens of thousands of years, and at least some admixture across several archaic groups" (from Jason Antrosio again). Clearly, this is a situation for which networks were especially designed.

As an example, we can take the "pairwise autosomal DNA sequence divergences" provided by Reich et al. (2010) for five of the genomes for which they collected SNP data. We cannot derive an evolutionary network directly from these data, but a data-display network will allow us to assess how tree-like are the data presented. Figure 1 shows a NeighborNet analysis of the data. This indicates that the data are strongly tree-like, mainly because of the authors concerted attempts to "clean up" the data from sequencing and analysis artifacts that would otherwise obscure the tree signal in ancient DNA. Nevertheless, there are still two detectable non-tree signals: one linking the Denisovan to the Neandertal from Mezmaiskaya (both fossil locations are in Russia), and a larger one linking the Denisovan to the Yoruba human (from a West African ethnic group). The first signal may represent non-tree gene flow, although the second signal is harder to explain (ancestral polymorphism, perhaps?).

Figure 1. NeighborNet analysis of the autosomal DNA sequencedivergences for two modern humans (San, Yoruba), two fossilNeandertals (Mezmaiskaya, Vindija), and a fossil Denisovan.

Mitochondrial genome

Mitochondrial DNA (mtDNA) is the most commonly collected source of genetic data, especially sequences of the so-called control region (including the D-loop). Moreover, it is now quite commonplace to sequence the >16,500 nt of the mtDNA genome, as indicated by the contents of the mtDB (Ingman and Gyllensten 2006) and MitoTool (Fan and Yao 2011) databases. Mitochondrial DNA has also been successfully extracted from ancient hominins. Indeed, there are now sequences for the entire mtDNA genome of Denisovans (Krause et al. 2010a), Neandertals (Green et al. 2008, Briggs et al. 2009), and early modern humans (Ermini et al. 2008, Gilbert et al. 2008, Krause et al. 2010b). Compared to nuclear DNA, ancient mtDNA has a greater survival rate and greater degree of sequencing coverage, which leads to a markedly reduced influence of post-mortem damage and contamination (see Ho and Gilbert 2010).

The major assumed advantages of using mtDNA are (i) the high copy number, (ii) the maternal mode of inheritance, (iii) the high substitution rate (resulting in variation even at the intraspecific level), (iv) the lack of recombination (so that historical relationships can be modelled by a phylogenetic tree), and (v) the molecular clock is considered to be relatively reliable (so that the dates of historical events can be estimated). Both of these latter two assumptions have been disputed, however, as discussed by McVean (2001) for recombination and Endicott et al. (2009, 2010) for the clock.

The available data indicate that recombination in mtDNA is rare, if it occurs at all. Furthermore, gene flow is unlikely to complicate the historical relationships, because the mitochondrion is almost always inherited maternally and there is little evidence of historical movement by single females between populations, as opposed to movement by males. So, a phylogenetic tree is a reasonable model of evolutionary history for mtDNA, unlike the situation for the nuclear genome.

On the other hand, there are a number of issues that will make any attempt to reconstruct a tree problematic. That is, the data will not be tree-like, even if the genealogical history was tree-like. First, the genes in mtDNA are completely linked as a single locus, which will lead to deep coalescence (incomplete lineage sorting), thus disconnecting gene history and organism history. Second, mtDNA exhibits considerable heterogeneity in nucleotide-substitution rates along the genome, with the control region having very high rates (up to 10x that of the reset of the mtDNA) and codon second positions having very low rates. Indeed, it is likely that substitutional saturation occurs in the control region, and that purifying selection occurs at first and second codon positions. There will be an enormous amount of homoplasy under these circumstances (eg. parallel substitutions). Third, there is evidence of different nucleotide-substitution rates in different lineages, even when those lineages are closely related. This will also cause homoplasy.

There have been three responses to these problems by those who study human mtDNA. First, trimming of the sequence data occurs. For example, there are well-known nucleotide positions that are usually deleted because their variation seems random, and others whose excessive variation leads them to be down-weighted. Second, a network is used to assess how non-tree-like are the data. People have developed several network methods explicitly for mtDNA data, such as Median-Joining and Reduced-Median networks; and the literature is replete with papers using these methods to analyze mtDNA sequences. Third, a partitioned model is needed in order to build a phylogenetic tree. Notably, the different codon positions need separate substitution models, as do the control region and the RNA-coding regions. Furthermore, rate heterogeneity needs to be modelled, and a relaxed molecular clock is needed.

These problems are bad enough for the study of within-human phylogenies, but they are even more problematic for the study of ancient DNA. For example, substitutional saturation means that the control region, and especially the three hypervariable regions (HVR1,HVR2,HVR3) that are the most frequently sequenced parts of it, is almost useless for reconstructing ancient history. This can be seen, for example, in the data of Dalén et al. (2012), who analyzed the mtDNA control regions of 13 Neandertals and 1 Denisovan. Dalén et al. produced a bayesian tree from these data, but in Figure 2 I show a Median Network instead. (This displays all of the maximum-parsimony trees simultaneously.) There may well be an evolutionary tree in these data, but if so then it is pretty deeply buried, and it is unlikely to be recovered reliably without a lot of work.

Unfortunately, for the study of ancient DNA very little seems to be done about the problems of homoplasy, in terms of any of the three suggested solutions. Indeed, most of the concern seems to be about potential post mortem damage to the DNA (eg. extra substitutions in the terminal branches), instead. For example, I have checked 21 empirical phylogenetic studies involving Neandertal mtDNA (published since 1997), and only 6 of them noted that they had either down-weighted or excluded particular hyper-variable nucleotide positions: Krings et al. (1999, 2000), Caramelli et al. (2006), Ermini et al. (2008), Moradi and Schuster (2008) and Endicott et al. (2010). Second, only three of the papers presented an empirical network analysis: Ermini et al. (2008) (a Reduced-Median network), Caramelli et al. (2006) (a Median-Joining network) and Caramelli et al. (2008) (a TCS network); for the rest, they either presented a tree, an ordination, or no empirical diagram at all. Third, only two of the analyses performed a partitioned tree-building analysis: Green et al. (2008) and Endicott et al. (2010). Finally, 14 of the 21 papers were based on sequences of the control region only, which makes their phylogenetic inferences questionable.

If I concentrate here on the production of a phylogenetic network, as I should be doing in this blog, then it is will become obvious why tree-building analyses are rather difficult for Neandertal sequence data. Figure 3 uses a data-display network to show the non-tree features of the available Neandertal mtDNA genomes. Note that there is very little common variation at all, meaning that Neanderthal mtDNA has very limited genetic variation. Moreover, there are no tree-like parts to the diagram, with every parsimony-informative nucleotide position being contradicted by at least one other. Analyzing these data with a simple tree-building analysis seems to be inappropriate, to say the least.

Figure 3. Median Network analysis of the six full-length mtDNAgenomes currently available for Neandertals. The numbers on thebranches indicate the number of characters that change alongeach branch.

To assess the relationship between Neandertals and humans (which seems to be the most common ancient-DNA question addressed in the literature), we can add the Denisovan mtDNA sequence, plus the 3 available sequences for early modern humans, and also some sequences from a range of modern humans (ie. the revised Cambridge Reference Sequence, plus 53 sequences from Ingman et al. 2000). However, we then cannot plot the Median Network because several of the aligned positions are no longer binary (ie. they are not SNPs). So, I will use a NeighborNet analysis for the data display instead, as shown in Figure 4. The first thing to note is that the genetic variation in the Neanderthal mtDNA is much less than that in the human mtDNA, and probably less than can be accounted for solely by the smaller sample size (6 genomes versus 54).

Second, there is clearly an underlying tree-like structure to the data, as expected, which I have emphasized by plotting the related Neighbor-Joining tree for comparison in Figure 5 (the NeighborNet analysis is a generalization of the Neighbor-Joining tree). However, there is just as clearly considerable non-tree structure to the data, notably in the relationship of the Denisovan sequence to the other sequences, but also in the relationship between the Neandertals and the humans. It is this non-tree structure that complicates any attempt to reconstruct the evolutionary relationship of the Neandertals to humans; and it appears to result, at least partly, from the homoplasy caused by saturation of nucleotide substitutions.

Figure 5. Neighbor-Joining tree of the same data used for Figure 4.

However, even the NeighborNet analysis cannot summarize all of the non-tree patterns in the data, but presents instead a selective summary of them. To get further insight into the extent of the problem, I have deleted the 53 human sequences, and then plotted the Pruned Quasi-median network in Figure 6. This network is the equivalent of the Median Network while allowing for non-binary sequence positions. It is difficult to believe that these data were created by a simple tree-like evolutionary process, and, if so, that it will be easy to reconstruct it.

Figure 6. Pruned Quasi-median network analysis of the mtDNAgenomes from 6 Neandertals, 1 Denisovan, 3 early modern humansand 1 contemporary human (the revised Cambridge ReferenceSequence). The branch lengths are not drawn to scale.

Anyway, the most-common network approach to trying to untangle this sort of mess in mtDNA sequence data is to use either a Reduced-Median network or a Median-Joining network, which are simplifications of the full Median Network. I have produced a Median-Joining network in Figure 7, as an example. The interesting thing to note here is that the Denisovan sequence does not connect to the rest of the network between the Neandertal cluster and the human cluster of sequences, which it does do in all of the published phylogenetic trees. This pattern is not unexpected, given the pattern shown in the Pruned Quasi-median network (Figure 6), but it does suggest that the tree-building analyses performed to date are somewhat naïve in the face of considerable sequence complexity, by not explicitly dealing with that complexity.

Figure 7. Median-Joining network analysis of the same data usedfor Figure 4. Only the sequences from Figure 6 are labelled — theother dots are the remaining 53 contemporary humans, plus someinferred ancestors. The branch lengths are not drawn to scale.

Conclusion

The phylogenetic analysis of Neandertal mtDNA has been critiqued a number of times before (eg. Gutiérrez et al. 2002, Hebsgaard et al. 2007, Moradi and Schuster 2008, Endicott et al. 2010). However, this has always been in the context of "providing a better tree-building analysis", rather than in the context of evaluating and displaying the conflicting information that complicates the tree-building analysis, as I have done here. In this context, it is important to note that none of the diagrams that I have produced here are evolutionary networks, and so they do not represent a reconstruction of evolutionary history. They are intended merely to display the convoluted nature of the ancient mtDNA sequence data, and to emphasize the valuable role that phylogenetic networks can play in evaluating such data.

One further point worth noting is that these diagrams are all unrooted, which neatly avoids the problems associated with adding a chimpanzee sequence in order to locate the root of the evolutionary history. Adding this sequence dramatically increases the sequence complexity, of course. In particular, the nuclear genome apparently places the Denisovan as the sister to the Neandertals whereas the mtDNA places it as the sister to Neandertals+humans (eg. note that the mid-point rooting of Figure 5 would be on the branch leading to the Denisovan).

Monday, September 17, 2012

There are no new problems in science, only old ones that obtrude themselves on you in new ways. One biological example of this truism is the discovery by molecular biologists that different genetic datasets produce different phylogenetic trees. This has been known at the phenotypic level for centuries, so that systematists have used reticulating structures such as maps, networks, webs etc to try to display the many different ways in which biological organisms appear to be related to each other, with different relationships shown by different parts of the phenotype.

This multi-faceted nature of relationships became problematic after 1859, when biologists more-or-less settled on trees as a means of displaying genealogy, because trees show only one set of relationships. Indeed, Darwin himself (1851, 1859) noted that different organ systems suggest different relationships. However, he believed that a taxonomic classification should be based on the structure of the entire body, not just one organ system, so that all relationships are considered. He thus concluded that it is necessary to give each kind of feature a different "weight" in contributing to the overall scheme, both for working out phylogenies and for erecting classifications based on them.

Darwin did no explicit phylogenetic work of his own, and so it was St George Mivart who first had to deal with this situation empirically (as discussed this previous blog post). His early work was principally on the comparative anatomy of primates, for which he provided very detailed comparisons of the skeletons of a large number of species, notably in Mivart (1865), based on the axial skeleton (or spinal column), and Mivart (1867), based on the appendicular skeleton (or limbs).

In the 1865 paper Mivart noted that the data for the spinal column "lead to an arrangement of groups and an interpretation of affinities somewhat differing from, yet in part agreeing with, the classification founded on cranial and dental characters". Moreover, the 1865 and 1867 studies did not produce the same phylogenetic tree. Mivart explicitly noted in a letter to Darwin (1870): "The diagram in the Pro. Z. Soc. [1865] expresses what I believe to be the degree of resemblance as regards the spinal column only. The diagram in the Phil. Trans. [1867] expresses what I believe to be the degree of resemblance as regards the appendicular skeleton only" (Darwin Correspondence Project, letter 7170).

In the modern world, one way to deal with this sort of data conflict is to use a phylogenetic network rather than a phylogenetic tree. That is, we do not need to produce a consensus classification based on weighting different datasets (as suggested by Darwin), and we do not need to produce a series of conflicting trees (as done by Mivart). We do not even need to produce a consensus tree, which would be a combination of these two approaches. We can, instead, display the conflict among the different data sources in a single diagram, as a network.

Click to enlarge.

Here, I have produced a network based on Mivart's 1865 and 1867 trees. There are 24 taxa of Primates, although the Gorilla was not in the 1865 dataset, and Inuus, Cynocephalus, Chrysothrix were not in the 1867 dataset. From these two trees I have produced a SuperNetwork, using the SplitsTree program. This is thus a splits graph, interpreted in the usual manner, so that the reticulated parts represent conflict between the two trees and the non-reticulated parts represent agreement.

The main conflict between the two trees is in the relationship between the Nycticebinae and Cebidae, shown as the centre reticlulated area in the network. Basically, the root of Mivart's trees is between these two groups, and they swap sides of the root between 1865 and 1867! Another cause of this netted area is whether Homo is within the Simiinae (as in the 1865 tree) or sister to the Apes (as in the 1867 tree). These two sources of conflict are quite major, from the biological point of view.

The bottom netted area of the network is caused by conflicts about relationships within the Lemuroidea, which are of less consequence. The top netted area refers to whether Simia is sister to either Hylobates (in 1867) or Troglodytes (in 1865), which is a relatively minor point.

It is tempting to see Mivart's change in the position of Homo (from 1865 to 1867) as psychological rather than empirical. Mivart came to reject the idea that humans should be placed in a phylogenetic tree, and expressed this strongly in Mivart (1871). He then became a strong opponent of Darwinian evolution for a time. Nevertheless, he returned to the fold at least once, in Mivart (1881), where he presented a phylogenetic tree of much of the Mammalia, based on dentition. However, he conspicuously excluded the Primates from this tree, thus dodging the theological problem entirely.

Wednesday, September 12, 2012

Current methods for evolutionary networks include: (i) combining trees, clusters or triplets into what is usually called a hybridization network (but could also be a horizontal gene transfer network, HGT), and (ii) decomposing ordered character data into what is called a recombination network (or ancestral recombination graph). Much work on these two approaches has been carried out recently within the bioinformatics community, and this is continuing.

However, the biology community has sometimes taken a different approach. Notably, work has concentrated on constructing models for detecting reticulation events in various types of molecular data, such as comparative genome analysis for HGT, or quantifying inter-population gene flow (eg. due to migration). A network is then manually constructed by adding reticulation branches to a phylogenetic tree of the organisms concerned. Indeed, in many cases the network diagram is not presented explicitly in the publications, but is merely implied from a list of the sources and sinks of the gene flows detected.

The network model for this latter type is thus essentially "a tree obscured by vines", although the network can actually become rather complicated. The basic idea has a long history (Lathrop 1982), although it has only recently become popular. In this blog post I highlight one line of recent work that takes this approach, which involves admixture graphs in population genetics.

Introduction

Historically, population genetics has concentrated on estimating various population parameters from quantitative models of gene history, notably rates of population expansion/contraction, rates of migration, timing of divergence, and presence/absence of bottlenecks. This is rarely done in any graphical way, relying instead on summary statistics. Alternatively, graphical methods such as principal components analysis and agglomerative clustering have been used to summarize the genetic data, and from this summary various scenarios can be deduced post hoc about possible population history (e.g. Skoglund and Jakobsson 2011; Hodoglugil and Mahley 2012).

However, more recently, explicit models of historical gene flow between populations have been developed, usually within the context of generalizing a phylogenetic tree. A tree can be used to represent historical relationships in the absence of significant amounts of gene flow, but not otherwise. So, the general approach has been to use a tree as the null model (representing absence of gene flow), and then testing how many reticulation events are needed to significantly improve the fit of the data to an increasingly complex network. The resulting diagram is called an admixture graph, which thus models both population divergence and gene flow. The reticulations represent the different proportions of genetic mixing between pairs of populations.

A model of population separation and admixture, from Reich et al. (2011) p. 522.

Methods

There are several computer programs that quantify population structure in the presence of admixture between populations, such as the models used in the older programs Structure, BAP5 and TESS (see François and Durand 2010), as well as in more recent programs like Admixture (Alexander et al. 2009). However, the most recent programs have been developed specifically to deal with network analysis of genome-wide single nucleotide polymorphism (SNP) data. The populations studied will usually be within a single species, but this need not be so.

The TreeMix program (Pickrell and Pritchard 2012) is described by the authors as follows: "Our goal is to provide a statistical framework for inferring population networks that is motivated by an explicit population genetic model, but sufficiently abstract to be computationally feasible for genome-wide data from many populations (say, 10-100) ... Our approach to this problem is to first build a maximum likelihood tree of populations. We then identify populations that are poor fits to the tree model, and model migration events involving these populations." This process proceeds as for the standard tree-based approach except that the likelihood model also includes migration weights: "Estimation involves two major steps. First, for a given graph topology, we need to find the maximum likelihood branch lengths and migration weights. Second, we need to search the space of possible graphs. [For] a given graph topology, we iterate between optimizing the branch lengths and weights ... [Then,] to search the space of possible graphs, we take a hill-climbing approach."

The AdmixTools program (Patterson et al. 2012), as claimed by the authors, "has some similarities to the TreeMix method but differs in that TreeMix allows users to automatically explore the space of possible models and find the one that best fits the data (while our method does not), while our method provides a rigorous test for whether a proposed model fits the data (while TreeMix does not)." The explicit testing of the fit of data and model is "based on studying patterns of allele frequency correlations across populations. The 3-population test is a formal test of admixture and can provide clear evidence of admixture, even if the gene flow events occurred hundreds of generations ago. The 4-population test ... is also a formal test for admixture, which can not only provide evidence for admixture but also provide some information about the directionality of the gene flow. The F4 ratio estimation allows inference of the mixing proportions of an admixture event".

These methods have not yet been subjected to any critical evaluation independently of their developers, although various blog authors have been actively investigating them (e.g. these posts by Dienekes Pontikos: 1, 2, 3). The general approach, of adding reticulations to an initial tree, is reminiscent of that taken by the T-Rex program to produce reticulograms, which has been subject to criticisms (Gauthier and Lapointe 2002, 2007; Huson et al. 2011), some of which may apply to the admixture methods as well.

Monday, September 10, 2012

In previous posts I have illustrated several of the evocative metaphors that have been used to describe reticulating evolutionary relationships. Today, I thought that I might produce a list of the ones that I know about. I have included the first source that I am aware of, along with a picture. Metaphors are a special case of analogies (together with similes) and we should, of course, be wary of taking them literally.

Warp and weft

Roland B. Dixon (1928) The Building of Cultures. Charles Scribner's Sons, New York.
(the warp and weft to the tapestry of culture history) "of these two elements the fabric of a people's culture is woven. The foundation or warp comes from within [heritage], the exotic elements or weft, from without [diffusion from other groups]"

Intermingled blood streams

Earnest A. Hooton (1931) Up From the Ape. Macmillan, New York.
"the various ways in which human blood streams have intermingled to form the principal races ... a sort of arterial trunk with offshoots and connecting vessels"

Ralph Linton (1955) The Tree of Culture. Alfred A. Knopf, New York.
"the branches of the banyan tree cross and fuse and send down adventitious roots, which turn into supporting trunks"

Braided river

John H. Moore (1994) Putting anthropology back together again: the ethnogenetic critique of cladistic theory. American Anthropologist 96: 925–948.
"the channels of a river separate and recombine in a complex fashion, just as the component populations of the human species separate and recombine" (this contrasts with the "forked river" metaphor for dichotomous evolution)Update: This metaphor goes back much further; see Rivers of Life, instead of trees.

Rhizome

Gilles Deleuze and
Félix Guattari (1976) Rhizome; first published translation by Paul Foss and Paul
Patton in 1981. This article appeared in revised form as the Introduction to
Mille Plateaux (1980) Les Editions de Minuit, Paris.
"The evolutionary scheme would be made not only based on tree-like models of descent, but along a rhizome, built directly within heterogeneous populations and jumping from one already differentiated line to another one." (a rhizome is an underground stem that sends out roots and shoots that develop into new plants)

"HGT events, even when relatively common, still leave the treelike history of phylogenies intact, much like cobwebs hanging from tree branches."

Ring

Maria C. Rivera and James A. Lake (2004) The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature 431:182-185.

Anastomosing plexus

I have used this expression a couple of times, having learned it in my youth, but it has not yet caught on. A plexus is a combination of interlaced parts, and is most commonly used for nerves, blood vessels or lymphatics. The picture shown here is the cervical and brachial nerve plexuses.

Net / Network

This is a tricky one. Donati used the Italian word "rete" (net or network) for biological relationships, but he was not explicitly referring to evolution: "the natural progressions should have to be compared more to a net than to a chain, that net being, so to speak, woven with various threads". Buffon used the French word "arbre" (tree) for his seminal reticulating diagram that clearly referred to genealogy, describing it as "a kind of family tree". Duchesne also used the French expression "arbre généalogique" (family tree) for his later network, while Pax used the German word "verwandtschaftlichen" (family relationship) for his. So, it is not clear who first actually used the word "network" in an explicitly evolutionary context. The first reference to a "phylogenetic net" was probably by Grant, and the first reference to a "phylogenetic network" by Holmquist.

This is just as tricky as "network". The metaphor has a long history but not necessarily in an evolutionary context; and it may be best left as a metaphor for ecosystem relationships. It has recently been revived by evolutionary bacteriologists dealing with horizontal gene transfer.

Note: many other metaphors have been used for biological relationships (such as scale, map, crystal, tangled bank), but these have not been applied to explicitly evolutionary relationships, as far as I know.

Saturday, September 8, 2012

There have been a number of recent posts in the blogsphere about what is perceived to be the rather poor quality of many computer programs in bioinformatics. Basically, many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.

I thought that I might draw your attention to a few of the posts here, for those of you who write code. Most of the posts have a long series of comments, which are themselves worth reading, along with the original post.

At the Byte Size Biology blog, Iddo Friedberg discusses the nature of disposable programs in research, which are written for one specific purpose and then effectively thrown away:Can we make accountable research software?
Such programs are "not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories." I have much sympathy for this point of view, since all of my own programs are of this throw-away sort.

However, Deepak Singh, at the Business|Bytes|Genes|Molecules blog, fails to see much point to this sort of programming:Research code
He argues that disposable code creates a "technical debt" from which the programmer will not recover.

Titus Brown, at the Living in an Ivory Basement blog, extends the discussion by considering the consequences of publishing (or not publishing) this sort of code:Anecdotal science
He considers that failure to properly document and release computer code makes the work anecdotal bioinformatics rather than computational science. He laments the pressure to publish code that is not yet ready for prime-time, and the fact that computational work is treated as secondary to the experimental work. Having myself encountered this latter attitude from experimental biologists (the experiment gets two pages of description and the data analysis gets two lines), I entirely agree. Titus concludes with this telling comment: "I would never recommend a bioinformatics analysis position to anyone — it leads to computational science driven by biologists, which is often something we call 'bad science'." Indeed, indeed.

Back at the Business|Bytes|Genes|Molecules blog, Deepak Singh also agrees that "a lot of computational science, at least in the life sciences, is very anecdoctal and suffers from a lack of computational rigor, and there is an opaqueness that makes science difficult to reproduce":Titus has a point

This leads to Iddo Friedberg's post (the Byte Size Biology blog) that mentions this concept:The Bioinformatics Testing Consortium
This group intends to act as testers for bioinformatics software, providing a means to validate the quality of the code. This is a good, if somewhat ambitious, idea.

Finally, Dave Lunt, at the EvoPhylo blog, takes this to the next step, by considering the direct effect on the reproducibility of scientific research:Reproducible Research in Phylogenetics
He notes that bioinformatics workflows are often complex, using pipelines to tie together a wide range of programs. This makes the data analysis difficult to reproduce if it needs to be done manually. Hence, he champions "pipelines, workflows and/or script-based automation", with the code made available as part of the Methods section of publications.

Monday, September 3, 2012

One concern with the current move from phylogenetic trees to phylogenetic networks is the increased complexity of a reticulating network versus a dichotomous tree. People fundamentally have trouble with interlinked and overlapping structures, and a network is more complex than a tree, just as a tree is more complex than a chain (see this previous blog post).

However, if we restrict ourselves to a two-dimensional representation, then there is a limit to how complex a network can be and yet still be interpretable. The network shown here, published by the anthropologist Franz Weidenreich, comes close to that limit.

Pedigree of the Hominidae, from Weidenreich (1947) p. 201.

This is usually referred to as a "trellis" or "lattice", for obvious reasons. It first appeared in Weidenreich (1946) and then again in Weidenreich (1947); and it has recently been re-published several times (eg. by Brace 1981; Templeton 2007; Caspari 2008). It is "an attempt to present graphically the relation between the different hominid forms in time and space", expressing Weidenreich's idea that evolution is "transformation, in close connection with inter-breeding".

The labelled circles refer to named fossil species of the Hominidae. According to Weidenreich, the vertical lines represent different stages of human evolution through time, the horizontal lines represent the morphological differentiation between different geographical regions, and the diagonal lines represent patterns of gene flow ("crossing") between the populations. Thus, the trellis emphasizes continuity of descent (and ancestry) through time within geographical regions (vertically), while also emphasizing gene flow between the regional lineages (horizontally and diagonally). In particular, note that in the figure the horizontal and diagonal lines are just as important as the vertical lines — this is not a tree obscured by vines!

Weidenreich viewed humans as being a single polytypic species throughout the Middle and Late Pleistocene, with nearly continuous gene flow during that time. This gene flow was seen as an integral part of the evolution of modern humans, dispersing genes throughout the species, so that any one recent human is likely to have had Pleistocene ancestors from different parts of the planet. This has been called a "polycentric model" of human evolution, also known as the "multi-regional model".

From Howells (1959).

However, racial thinking (as discussed in this previous blog post) has led to tree-like models of human evolution, and so Weidenreich's network model of inter-connected groups was either ignored or mis-interpreted (see Brace 1981; Caspari 2003; Templeton 2007). In particular, the trellis was repeatedly re-drawn as a tree, usually referred to as a candelabrum. This mis-representation started with the work of William White Howells (eg. Howells 1959), as shown above, which then became the source for most subsequent discussions of the multi-regional model, rather than Weidenreich's original. (Actually, Howells' mis-interpretation of Weidenreich's multi-regional model dated way back to 1942; see Hawks & Wolpoff 2003.)

Interestingly, the trellis metaphor has been revived as a model for recent human evolution, notably by Alan Templeton, as shown in the final two figures.