Wednesday, August 29, 2012

Evolutionary networks have recently become a hot topic of discussion. However, although networks have rather a long history in some parts of biology (see this previous post), historically it is phylogenetic trees that have dominated in biology, rather than phylogenetic networks. Interestingly, during the first half of the 20th century one research area where networks were to be found somewhat more commonly is anthropology.

Humans have long been considered to have a reticulate evolutionary history, both genetically and culturally (Moore 1994), and anthropologists have, on occasion, therefore used networks as one of their representations of that intra-species history (Brace 1981). This does not mean that trees have not dominated in anthropology also (Caspari 2003), as elsewhere; and the consequences of reticulation for anthropological studies form an ongoing debate (Holliday 2003; Arnold 2009). Interestingly, modern anthropologists are still coming to terms with the genetics of reticulation (see Jolly 2009), having previously been distracted by the Evolutionary Synthesis as well as by fossils (Hawks and Wolpoff 2003).

Some Anthropological Trees and Networks

We can start this brief survey with a tree from Arthur Keith (1915). There is no indication of reticulation at this early stage of the century, and thus the genealogy seems to owe much in principle to Ernst Haeckel's (1868) tree from the previous century.

Keith (1915) Figure 187. Genealogical tree of man's ancestry.

Carleton Stevens Coon (1939) had a polyphyletic view of human origins but also believed in a degree of reticulation, as shown in the next diagram. Note, however, that the most common European race, Mediterranean, does not take part in the reticulation events.

Earnest Albert Hooton (1931) was an even stronger believer in the reticulate nature of human microevolution. He commented that the following figure: "represents my idea of the various ways in which human blood streams have intermingled to form the principal races. It is not a family tree, but a sort of arterial trunk with offshoots and connecting vessels."

Hooton (1931) Figure 58. The blood streams of human races.

He modified this figure for the revised edition of his book (Hooton 1946), making it even more complex.

Hooton (1946) Figure 68. Blood streams of human races.

Elsewhere in the same book, Hooton (1946) produced this next diagram, which expresses a more phylogenetic idea. Indeed, it comes very close to the modern idea of a tree obscured by vines.

Hooton (1946) page 413.

Finally, we can consider a modern anthropological network, based on polymorphic genetic markers. This one is from Campbell and Tishkoff (2010), in which they note: "decreasing intensity of color represents the concomitant loss of genetic diversity as populations migrated in an eastward direction from Africa. Solid horizontal lines indicate gene-flow between ancestral human populations and the dashed horizontal line indicates recent gene-flow between Asian and Australian/Melanesian populations."

Campbell and Tishkoff (2010) Figure 2. The Recent African Origin model of
modern humans and population substructure in Africa.

Discussion

This whole approach to the analysis of human history presupposes that races exist as more-or-less distinct lineages, which is an idea that is not all anthropologists support. Genomically, humans seem to form what might be called fuzzy clusters, rather than discrete groups with sharp boundaries (Novembre et al. 2008; Lao et al. 2008). Inter-breeding is predominantly within the clusters, due to geographical and social isolation, with relatively little inter-breeding between the clusters. This creates a situation where gene-based distinctions between "races" seem to be obvious to casual observers but where more detailed analysis reveals considerable complexity. This results from the evolutionary history being a network not a tree.

So, this raises a point that anthropologists have been struggling with for some time, and which all network biologists need to address at some stage: Are distinct evolutionary lineages worth recognizing when there is extensive reticulation in a network? From the analysis point of view, the recognition of races is a model, and all models are wrong (because they are simplifications of the real world). However, some models are more useful than others. So, the question can be re-phrased as: Is the recognition of distinct evolutionary lineages a worthwhile model for interpreting a reticulated network? After all, the lineages may not form nested phylogenetic clusters, which is historically the basic criterion for recognizing them.

Domesticated organisms provide other classic examples of genealogical reticulation. We recognize dog breeds, for instance, and we even have an official register of breeds at the Fédération Cynologique Internationale. However, dog breeds form fuzzy clusters rather than discrete groups, with many individual dogs being cross-breeds. In spite of this, a model of fuzzy clusters formed by a reticulate evolutionary history is still considered to be useful by dog breeders and owners. A similar thing can be said about the breeds of horses, cats and cows; and, indeed, also for almost all human-associated species (see Arnold 2009).

In the non-domesticated part of biodiversity, systematists recognize subspecies, which often refer to morphologically distinguishable populations occupying geographically separated areas, but which are not otherwise genetically isolated. These subspecies can also form fuzzy clusters as a result of a reticulate evolutionary history, especially for plants. Once again, this is apparently a useful model, although there is no universal criterion for how much morphological difference it takes to delimit a subspecies.

I have noted before (see this blog post) that using a tree model for the evolutionary history of dog breeds is inappropriate, because of the reticulate inter-breeding. However, the question here goes further than this, and asks about what should be the units of analysis in the first place. If it is the dog breeds, then we are effectively excluding cross-bred dogs from the evolutionary history, unless they themselves form a new breed that is subsequently recognized.

This issue has profound consequences for our view of possible human races. Most of the networks shown above use races as the units of analysis. Modern evolutionary diagrams of human ancestry, on the other hand, are more likely to be based on genetic data from individual people (as shown in the last figure), which does not pre-suppose the existence of races. Races (if they exist) are then an outcome of the analysis, rather than an input. This distinction has been of particular importance for anthropology, where for most of the past century it has been assumed that discrete races exist and can be fitted into a non-reticulating phylogenetic tree (Caspari 2003; Arnold 2009). Even the very language of naming races creates a supposition that those races are "real", and so care is needed.

Historically, studies of race and human evolution have been inexorably linked. One problem with the current discussions about race is the confusion over whether races are sociological constructions or biological ones (Tattersall and DeSalle 2011; Krimsky and Sloan 2011). My point here is that, either way, they are a model of fuzzy clusters formed by a reticulate evolutionary history, at best, rather than being discrete groups. They have clearly been misused in sociology (racism), but are they a useful model in biology (racialism)?

Monday, August 27, 2012

In an earlier post I reported on the creation of an Online Primer of Phylogenetic Networks, which is intended as a simple introduction to networks for those people who already know something about phylogenetic trees. The primer can be read online, or downloaded as a PDF file (for printing) or as an ePub file (for reading on small screens).

Here, I note that the online version of the primer has now been updated with three animations (animated GIF files). These illustrate:

Wednesday, August 22, 2012

Splits graphs are produced by distance-based network methods such as NeighborNet and Split Decomposition, by character-based methods such as Median Networks and Parsimony Splits, and by tree-based methods such as Consensus Networks and SuperNetworks. They are all interpreted in the same way, which is discussed here.

An essential point to understand is that splits graphs are separation networks. That is, the edges in the graph represent separation between two clusters of nodes in the network; or, they split the graph in two. Formally, each edge represents a bipartition (or split) of the taxa based on one or more characteristics.

If there is no conflict in the data then each bipartition is represented by a single edge, and if there are contradictory patterns then the each bipartition is represented by a set of parallel edges. The edge lengths represent the relative amount of support in the whole dataset for each of the splits.

Example

As a simple example, I will use some data about opinion polls prior to a few Australian elections. There are data for nine election years: 1972, 1974, 1975, 1977, 1980, 1983, 1984, 1987, 1990. The data are for the actual winning margin as a result of the election, as well as data for various opinion polls predicting the outcome prior to the election: (i) McNair Survey, (ii) Roy Morgan Research, (iii) Saulwick Poll, and (iv) Other = pooling of Australian National Opinion Polls (data for 6 years), Spectrum (3 years), Newspoll (2 years), Levita (1 year).

I have calculated the Euclidean Distances between the results for the different opinion polls. So, the original data have been reduced to a set of distances between pairs of opinion polls; and it is these distances that are to be displayed by the network.

This is a simple dataset, and so the analyses based on Split Decomposition and NeighborNet turn out to be identical. The resulting network looks like this:

In this case, the network manages to represent all of the distances perfectly. That is, the Fit=100%. This is an improvement over trying to represent the data as a tree, instead. For example, the Neighbor Joining tree for these data has a fit of only Fit=92%, so that 8% of the information cannot be represented in the tree.

The network has five informative splits (bipartitions), each represented by a different set of parallel edges. The remaining five splits are simply shown as the single edges leading to each of the five sources of data. The informative bipartitions are (in order of decreasing support):

Actual Morgan Other McNair Saulwick

Actual Morgan Saulwick McNair Other

Actual Morgan McNair Other Saulwick

Actual Other McNair Morgan Saulwick

Actual McNair Other Morgan Saulwick

These bipartitions are each highlighted below, with red representing one partition and blue the other. The weight of each split is also shown, which represents the amount of support there is in the data. This also determines the relative lengths of the edges (greater weight = more support = longer edge).

We can now start to reach some conclusions about the relative success of the opinion polls. For example, note that the three best-supported partitions (bipartitions 1, 2 and 3) associate Actual (the election result) with Morgan (the outcome predicted by Roy Morgan Research). We can thus conclude that this opinion poll has most in common with the election outcome, and thus that it was the most "successful" of the four opinion polls (over the elections from 1972 to 1990).

As noted, the edge lengths in the network represent the relative amount of support in the whole dataset for each of the splits. In this example, because Fit=100% the edge lengths along the shortest paths sum to exactly the original Euclidean Distances in the dataset, which will not always be so for other datasets. For example, the shortest distance from Actual to each of the four opinion polls is the sum of these edge lengths:

Morgan 2.89 = 1.5833+0.4861+0.1875+0.6319

Other 3.89 = 1.5833+0.4931+0.6493+1.1632

McNair 4.25 = 1.5833+0.4931+0.6493+0.4861+0.8125+0.2257

Saulwick 4.88 = 1.5833+0.4861+0.1875+0.4931+0.8125+1.3125

The calculation of the shortest distance from Actual to Saulwick is highlighted in this figure:

Note that there are several shortest paths from Actual to Saulwick — we can take the edges in any order we like so long as we cross each split only once. To go from Actual to Saulwick we have to cross four of the five informative splits, plus two of the other five splits.

Also worth noting is that the pathlengths in the Neighbor Joining tree do not sum to the Euclidean Distances. This is because the Fit<100%. For example, the pathlength from Actual to Saulwick is 4.74 = 1.8733+0.5278+0.6840+1.6539, so that 4.88–4.74 = 0.14 of the distance has been left out.

The pathlengths can also be used to evaluate the relative success of the opinion polls. That is, the network pathlength distance from Actual to Morgan is the shortest, which we can interpret as indicating that Roy Morgan Research was the most "successful" of the four opinion polls. That is, its predictions were the "least different" from the actual election results, across all of the elections.

Finally, there are features of the data that cannot be displayed in the network. The network is a summary only, and not all of the information can be summarized in a line graph! Perhaps the most notable missing information is that the McNair Survey was the only opinion poll to predict any of the election results exactly correctly, which it managed to do twice (in 1974 and 1983).

Monday, August 20, 2012

In a previous blog post I noted that Ernst Haeckel's first phylogenetic trees (published in Generelle Morphologie der Organismen, 1866) were distinctive in that many, if not most, of the labels occur in the spaces between the terminal twigs, like seeds enclosed within a fruit. In contrast, most other phylogenetic trees, both then and now, distinctly label the individual leaves/twigs.

I have, however, come across the Haeckel-like design in at least one recent place: the Darwin's Library blog for 19 April 2010. Note that the figure, reproduced here, is actually a network, as the branches fuse in at least two places, indicating hybridization of ideas.

Mathematically, I guess that we could best interpret the labels as referring to the nodes rather than to either of the two terminal twigs.

This phylogeny contains one other intriguing feature — the branch leading to both "email" and "blog" diverges from the branch leading to "scroll" and "codex", whereas I would have thought that blogs would be related to "newspaper" and "magazine", for example, which are descendants of "codex". Emails are related to letters, which do not appear in the phylogeny, but could, I guess, be on the long branch leading to "email".

Wednesday, August 15, 2012

Evolutionary networks are used in both biology and the social sciences. Furthermore, you will also occasionally find them elsewhere, as a means of displaying historical relationships among objects or concepts. Here are several successful examples of what I mean. In all cases the networks describe known relationships, rather than being inferred.

History of rose cultivars

This first one is a biological example, taken from the web page of Sparlösa Trädgård, a well-known Swedish garden. I interpret this as a hybridization network. It shows part of the history of rose cultivars (in the genus Rosa), with the arrows indicating the origin of most of the different types that one encounters in gardens today. The history is quite accurate, although no specific dates of origin are given; and a more detailed word description can be found at the web page of Kew Botanic Gardens.

Development of ancient helmets

This second one is an example of manufactured design, taken from the book by Peter Connolly (1981) Greece and Rome at War (Macdonald Phoebus Ltd, pp 60-61). The Kegel-Illyrian group of helmets are on the left, and the Corinthian-Chalcidian-Attic group are on the right. Near the top centre is a "cross-breed between the early Illyrian and Corinthian helmets, having more than one characteristic of each." The hybrid type at the bottom is the Attic type, while the one to the right is the Italo-Corinthian. The unconnected helmet is the Thracian type. Note also the parallel evolution of the cutaway for the ears.

Click to enlarge.

History of the U209 family of submarines

This third one is another example of manufactured design. In this case the U214 submarine was developed "on the basis of the proven design principle of the Class 209 family with additional incorporation of innovative features of Class 212A" (quoted from the Howaldtswerke-Deutsche Werft GmbH website). The 214 thus incorporates successful design features from the 209-1400mod submarine, which is the most recent version of the Class 209 family of diesel-electric patrol submarines, and the 212A, which has a hybrid diesel-electric/Air Independent Propulsion system based on fuel cell technology.

Click to enlarge.

History of Linux distributions

The fourth one is an example of ideas rather than objects, taken from the Futurist blog, although it is also available at Wikipedia. It allegedly shows the GNU/Linux Distribution Timeline, but it is actually drawn as a set of genealogies. Note that the networks are linearized to have a central axis, like a tree. The connections of the "side branches" to the "main axis" are somewhat meaningless — the time of origin of the distributions is represented by a dot, and the curved nature of the lines connecting those dots is nothing more than artistic fancy (they should all be vertical lines not s-shaped). These are networks because there is horizontal transfer (ideas added) and recombination (ideas replaced) among the distributions.

Monday, August 13, 2012

It is usually acknowledged that Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck (1744-1829), published an early evolutionary tree (Lamarck 1809). However, his published trees differ from our modern phylogenetic diagrams in having contemporary higher-level taxonomic groups at both the internal and external nodes, so that each tree represents a transformation series among the taxonomic groups. Thus, while his trees were based on the idea of transmutation, they do not match our current type of tree.

The later published trees of, for example, Charles-Hélion de Barbançois, Hugh Edwin Strickland, and Alfred Russel Wallace, followed the style of Lamarck. Other trees published in the first half of the 19th century, such as those of Jean Louis Rodolphe Agassiz, Augustin Augier, Heinrich Georg Bronn, and Edward Hitchcock, were not intended to be evolutionary diagrams, because their authors did not believe in evolution (Ragan 2009; Tassy 2011).

Charles Darwin (1859) is usually credited as being the originator of modern phylogenetic trees, with contemporary taxa at the leaves and ancestors at the internal nodes. Therefore, an answer to the question posed in the title must involve a post-Darwinian person. There appear to be four candidates for who published the first empirical Darwinian tree, in the period 1865-1866, two of them palaeontologists and two comparative morphologists, two with strong religious beliefs and two apparently without, and including one Englishman, one Frenchman and two Germans. I list them here in the probable order of publication.

St George Mivart was a comparative morphologist who was an early British convert to Darwinism, although he later fell out with Thomas Henry Huxley and therefore with Darwin. His work was principally on the comparative anatomy of primates, for which he provided very detailed comparisons of the skeletons of a large number of species, notably in Mivart (1865) and Mivart (1867).

The paper by Mivart (1865) therefore seems to be the first publication to contain an explicitly Darwinian tree. This is ironic, given the fact that Mivart later became one of Darwin's strongest critics. That it took 6 years (from 1859) for a biologist to produce such a tree may reflect the fact that Darwin himself published only a single theoretical sketch, thus leaving others to work out how to apply his ideas to empirical data.

The 1865 paper was read before the Zoological Society of London on 27 June 1865, and then appeared as a regular part of the Society's journal later that year (see Dickinson 2005). It was based on a detailed osteological analysis of the spinal columns of 29 primate genera. As noted by Bigoni & Barsanti (2011): "Not only does he use taxonomic names still largely in use today, but, surprisingly, Homo is not the apex or culmination of evolution ..., in fact it is placed on a lateral diverging branch. This position of humans provides his tree with a particularly modern appearance and is perfectly consistent with the trees or bushes that Darwin drew." Mivart's paper is available from the publishers John Wiley & Sons and also the Biodiversity Heritage Library.

Unfortunately, Mivart's tree does deviate from Darwin's ideas in that the leaves and many of the branches refer to higher taxonomic groups, rather than to species. In this sense his trees look similar to Ernst Haeckel's (see below), although it is doubtful that they were constructed in the same way. Note that Mivart's labels occur along the terminal twigs, rather than at their end, as his contemporaries chose to present them (and as we do today).

Also, being based on different data sets, the 1867 tree (based on the appendicular skeleton, or limbs) does differ in topology from the 1865 one (based on the axial skeleton, or spinal column), thus foreshadowing a problem with phylogeny reconstruction from different data sources that continues to this day (see this later blog post). Mivart explicitly noted in a letter to Darwin (1870): "The diagram in the Pro. Z. Soc. expresses what I believe to be the degree of resemblance as regards the spinal column only. The diagram in the Phil. Trans. expresses what I believe to be the degree of resemblance as regards the appendicular skeleton only" (Darwin Correspondence Projectletter 7170). Indeed, in the 1865 paper Mivart also noted that the data for the spinal column "lead to an arrangement of groups and an interpretation of affinities somewhat differing from, yet in part agreeing with, the classification founded on cranial and dental characters".

Mivart's work is discussed in detail by Bigoni & Barsanti (2011). Mivart's views on evolution and theology are presented in Mivart (1871).

From Mivart (1865) p. 592. Click to enlarge.

From Mivart (1867) p. 425. Click to enlarge.

Franz Martin Hilgendorf (1839-1904)

Franz Hilgendorf was a palaeontologist, among other zoological pursuits, although he is relatively unknown today. He was one of the first Germans to accept Darwin's ideas (Reif 1986), and he is also credited with being the first to introduce evolutionary theory into Japan (c. 1873) (Yajima 2007). He could also have been the first to publish a Darwinian tree, but he did not actually do it.

Hilgendorf's PhD work was on the fossil gastropods of the middle Miocene basin at Steinheim, in southern Germany, which he visited in 1862. He studied the morphological variation, in the different stratigraphic layers, of the various fossil forms of what he referred to as Planorbis multiformis. The resulting thesis (Hilgendorf 1863) was passed in April 1863 but was otherwise unpublished, and it apparently contained no images. Nevertheless, Hilgendorf discussed in detail the relationship between a complete stratigraphic series of fossils and Darwin's evolutionary ideas, concluding that the Planorbis fossils could be arranged in a phyletic tree; and Reif (1983) found that Hilgendorf's notes did, indeed, contain a preliminary phylogenetic diagram. Reif (1983) presents a version of this phylogeny based on Hilgendorf's notes, which is also reproduced by Janz (1999).

Hilgendorf may thus have been the first to produce a Darwinian tree, even though he did not publish it. His ideas were Darwinian by including ancestral and descendant forms, splitting of lineages, and gradual transition between forms, with the ancestral taxa being varieties, not higher taxa. Interestingly, in the thesis Hilgendorf also raised the possibility that two of the lineages may have fused. He noted: "This does not fit the nice picture of a tree with many branches that Darwin presented to illustrate the descent of species — the branches of a tree never fuse again" [translation taken from Janz 1999].

Hilgendorf then made another excursion to the Steinheim basin in 1865, and wrote up the results for publication, this time with an explicit tree showing the relationship between the 19 different fossil forms that he recognized. This was read as a paper before the Royal Prussian Academy of Sciences on 19 July 1866, and was apparently published simultaneously as an offprint (Hilgendorf 1866). The paper then appeared as a regular part of the Academy's journal (Hilgendorf 1867) — these two versions are evidently identical save only the absence of the subtitle in the latter [which translates as "an example of morphological change through time"]. This is thus Hilgendorf's first published Darwinian tree.

There are actually two versions of the tree in the paper, as shown here, taken from the Biodiversity Heritage Library. The first tree emphasizes which stratigraphic layers each morphological form occupies (there are ten layers), whereas the second tree emphasizes the forms themselves. There is no suggestion of lineage fusion in either tree.

Hilgendorf's work is discussed in detail by Reif (1983), and more generally by Janz (1999).

From Hilgendorf (1867) p. 479. Click to enlarge.

From Hilgendorf (1867) after p. 502. Click to enlarge.

Jean Albert Gaudry (1827-1908)

Albert Gaudry was a palaeontologist who was one of the very few French scientists to promote Darwinian evolution. Indeed, Darwin noted in a letter to Jean Louis Armand de Quatrefages de Bréau (1870): "It is curious how nationality influences opinion: a week hardly passes without my hearing of some naturalist in Germany who supports my views, & often puts an exaggerated value on my works; whilst in France I have not heard of a single zoologist except M[onsieur] Gaudry (and he only partially) who supports my views" (APS 379; An Annotated Calendar of the Letters of Charles Darwin in the Library of the American Philosophical Society 1799-1882, p. 212).

The paper by Gaudry (1866) was a separately paginated offprint of the second chapter of part 1 (pp. 325-370) of a larger work about the fossil mammals from the late Miocene locality of Pikermi in Attica, in Greece, which was completed in 1867 (Animaux Fossiles et Géologie de l’Attique d’après les recherches faites en 1855–56 et en 1860 sous les auspices de l’Académie des sciences). In this offprint Gaudry expressed his views on palaeontology and evolution. He noted that the Pikermi fossils showed characteristics of two or more groups of animals, so that he could see the passage from order to order, family to family, genus to genus, and species to species in these intermediate forms.

He included five trees showing the relationships among different groups of extant and extinct fossil mammals, within a stratigraphic framework. The pictures shown here are taken from Google Books. I do not know exactly when this offprint was published, but Darwin acknowledged on 17 September 1866 that he received it "some time ago", so that it might pre-date Hilgendorf's own offprint.

As emphasized by Tassy (2006, 2011), Gaudry's trees were Darwinian by including ancestral and descendant species, splitting, gradualism and extinction, with the ancestral taxa being species or sub-species, not higher taxa. However, Gaudry did not fully embrace Darwinism, for religious reasons. As Darwin noted in a letter to Gaudry thanking him for a copy of the 1866 offprint: "I will venture to make one little criticism, namely that you do not fully understand what I mean by 'the struggle for existence', or concurrence vitale; but this is of little importance as you do not at all accept my views on the means by which species have been modified." (Darwin Correspondence Projectletter 5213). Gaudry attributed evolutionary change to God, rather than to natural selection, as indicated in the closing sentence of his 1866 work: "Mais, nous n'eu douterons pas, l'artiste qui pétrissait était le Créateur lui-même, car chaque transformation a porté un reflet de sa beauté infinite."

Gaudry's work is discussed in detail by Tassy (2006), if you read French, and more briefly by Tassy (2011), if you do not.

From Gaudry (1866) p. 36. Click to enlarge.

From Gaudry (1866) p. 38. Click to enlarge.

From Gaudry (1866) p. 41. Click to enlarge.

From Gaudry (1866) p. 44. Click to enlarge.

From Gaudry (1866) p. 46. Click to enlarge.

Ernst Heinrich Philipp August Haeckel (1834-1919)

Ernst Haeckel is best known today as a comparative morphologist, but he was also an important popularizer of science, as well as a brilliant artist. He was an early German convert to Darwinism, and it has been noted that "more people by the turn of the century had learned of evolutionary theory through Haeckel's depictions than even from Darwin's own writings" (Richards 2011).

Haeckel actually coined the word "phylogeny" (along with many others, including "ontogeny" and "ecology"), and his first phylogenetic trees were published in the second volume of his two-volume opus about animal morphology (Haeckel 1866). Haeckel had the ambitious plan to reform the study of morphology, by synthesizing Darwin's ideas on genealogical descent with the transformational evolutionism of Lamarck, along with the German tradition of naturphilosophie (represented by Johann Wolfgang von Goethe). As noted by Hopwood (2006), for Haeckel: "evolution was the organizing principle of a cosmic synthesis that would unify science, religion, and art on a biological foundation."

There were eight trees in the book, showing the relationships between animals, plants and (for the first time) protists, and within plants and different animal groups. Haeckel used morphology to reconstruct the phylogenetic history of animals, and in the absence of fossils used embryology as evidence of ancestors. The pictures here are taken from the Biodiversity Heritage Library. These are frequently credited as being the first phylogenetic trees published, although Mivart, at least, published earlier. Haeckel claimed to have started the book "several years" before 1864, which is when he apparently started work on the phylogenetic trees (as he mentions in a letter to Darwin), but the Foreword is dated 14 September 1866.

Unfortunately, Haeckel's tree-construction method seems to have owed more to Lamarck than to Darwin (see Dayrat 2003), with the branches indicating morphological transformation among the named groups rather than strictly representing genealogy. Moreover, the trees show higher taxonomic groups at the internal branches, while Darwin treated them as representing extinct species. Thus, it is not clear just how Darwinian Haeckel really was. Indeed, Di Gregorio (2005) has noted: "Haeckel's view of evolution (or rather evolutionism) ..., from the very beginning, reminds one more of Jean-Baptiste Lamarck than Darwin."

One intriguing detail about Haeckel's early trees is that many, if not most, of the labels occur in the spaces between the terminal twigs, like seeds enclosed within a fruit. All of the trees shown above distinctly label the leaves, even though it it likely that their presentation format was derived independently of each other. Haeckel, on the other hand, appears to be much more vague about exactly what is being labelled. Perhaps this is a by-product of the fact that his images are distinctly tree-like in form, rather than being stick figures; or perhaps it comes from the rather speculative nature of many of the relationships proposed (the trees of Mivart, Hilgendorf and Gaudry were based on detailed empirical data, whereas Haeckel's were much more ambitiously hypothetical).

It is perhaps also worth noting that Haeckel first publicly endorsed Darwin's theory in his work on the Radiolaria (Haeckel 1862). On page 234 of that work (see the Biodiversity Heritage Library) he produced what he called a "Verwandtschaftstabelle der Familien, Subfamilien und Gattungen der Radiolarien", which is thus his first attempt at a genealogical diagram. It was not drawn as a tree, and is thus somewhat hard to interpret, but in the same chapter he discussed ancestral and transitional forms, and on pages 231–232 he made clear that he was attempting to implement Darwin's ideas.

Heackel's life and work are discussed in detail by Di Gregorio (2005) and Richards (2008).

From Haeckel (1866) Taf. I. Click to enlarge.

From Haeckel (1866) Taf. II. Click to enlarge.

From Haeckel (1866) Taf. III. Click to enlarge.

From Haeckel (1866) Taf. IV. Click to enlarge.

From Haeckel (1866) Taf. V. Click to enlarge.

From Haeckel (1866) Taf. VI. Click to enlarge.

From Haeckel (1866) Taf. VII. Click to enlarge.

From Haeckel (1866) Taf. VIII. Click to enlarge.

References

Bigoni, F., Barsanti, G. (2011) Evolutionary trees and the rise of modern primatology: the forgotten contribution of St. George Mivart. Journal of Anthropological Sciences 89: 93-107.

Darwin, C. (1859) On the Origin of Species by Means of Natural Selection, or the preservation of favoured races in the struggle for life. John Murray, London.

Reif, W.-E. (1983) Hilgendorf's (1863) dissertation on the Steinheim planorbids (Gastropoda; Miocene): the development of a phylogenetic research program for paleontology. Paläontologische Zeitschrift 57: 7–20.

Reif, W.-E. (1986) The search for a macroevolutionary theory in German paleontology. Journal of the History of Biology 19: 79-130.

Richards, R.J. (2008) The Tragic Sense of Life: Ernst Haeckel and the struggle over evolutionary thought. University of Chicago Press, Chicago.

Wednesday, August 8, 2012

In an earlier post I presented an evolutionary network showing the history of the various software and hardware components of the revolutionary Xerox 8010 "Star" computer. This network is shown below.

This network summarizes how various systems related to the Star have influenced one another over the years. Time progresses downwards (as indicated), double arrows indicate direct successors (i.e. follow-on versions), while single arrows indicate "influence" on the subsequent ideas. It is thus interpretable as a network showing hybridization and possibly horizontal transfer. It is thus a valid evolutionary diagram, and it neatly and succinctly displays the historical patterns of "descent with modification".

Unfortunately, a recent attempt to update this diagram is not successful at all.

Click to enlarge.

Note that this version loses most of the important components of an evolutionary diagram:
(i) there is no indication of the time direction
(ii) the edges are not directed
(iii) the bold lines are not explained as being direct descent.
Furthermore, in the accompanying text the (internal) nodes are referred to as "leaves". The authors have thus turned a directed graph into an undirected one, and have thereby created something with only ambiguous interpretation. This network no longer makes evolutionary sense, although it intends to display evolutionary information.

The authors are apparently well-intentioned, and they are trying to use an appropriate form of information display, but they have failed to appreciate the importance of using a directed versus an undirected graph when displaying historical relationships.

Monday, August 6, 2012

For the past two weeks I have been moonlighting as the Guest Blogger over at the Scientopia blog.

Most of the blog posts have been about ambiguity in the way that phylogeneticists present phylogenies (visually and verbally), both to each other and to the general public. This seems to me to be an important topic. The penultimate post in that series lists most of the research papers covering this issue, for anyone who would like to read up on it.

The other four posts cover diverse topics that I have thought about lately, which may be of more general interest:

Wednesday, August 1, 2012

I noted in an earlier blog post that phylogenetic analysis is used outside of biology, notably to study language evolution and cultural evolution. What is perhaps less well known is that it has also been suggested as applicable in the physical sciences, specifically to the "evolution" of galaxies (Keel 2002; Fraix-Burnet et al. 2003), which is called "astrocladistics". As noted by Fraix-Burnet (2004): "Assuming branching evolution of galaxies as a 'descent with modification', the concepts and tools of phylogenetic systematics widely used in biology can be heuristically transposed to the case of galaxies."

That is, a galaxy is a collection of stars, gas and dust, and galaxies change through time as a result of changing proportions of these components with different characteristics. This can be seen as analogous to variational evolution in biology, where changing proportions of individuals through time lead to evolutionary change within species. Since galaxy diversity can be expected to organize itself in a hierarchy (Fraix-Burnet et al. 2006b), a hierarchical diagram such as a tree would be appropriate for displaying galaxy "morphology" and history.

I am not sure that this isn't perhaps taking the analogy a bit too far, in the sense that in biology, language and culture there is inheritance of derived characters states from generation to generation, whereas for galaxies the stars etc undergo continuous physical change. Therefore, the logic of phylogenetic analysis, which ensures that there is biological meaning to the mathematical summary produced by a phylogenetic analysis, does not directly apply in the case of galaxies. One can claim that any change through time is "evolution", but that does make it the same as "biological evolution".

Nevertheless, one can certainly apply a phylogenetic analysis to data for galaxies, as demonstrated by Fraix-Burnet et al. (2006a, 2006b, 2006c). All one has to do is break up the continuous astrophysical measurements (eg. electromagnetic spectra, such as broadband magnitudes) into discrete character states, which are treated as ordered in the analysis (ie. changes between two adjacent states cost less than change between distant states), and then feed the resulting matrix into a tree-building program.

Fraix-Burnet et al. (2006a) did this for some data simulated by GALICS (Galaxies In Cosmological Simulations), which they say "is a hybrid model for hierarchical galaxy formation studies, combining the outputs of large cosmological N-body simulations with simple, semi-analytic recipes to describe the fate of the baryons within dark matter haloes". From the simulation they chose 10 galaxies (labelled A–J) at 5 different epochs (ie. steps in the simulation, corresponding to a redshift of 3, 1.9, 1.0, 0.4 and 0), for a total of 50 "taxa". They used 91 "characters", each broken into eight states "by regularly binning the corresponding range of values among all galaxies". These characters included mass, component radius, rotation speed, dynamical time, and star formation rates, but most of them referred to magnitudes for different broadband filters at different wavelengths. This matrix was subjected to a maximum-parsimony analysis (using PAUP*).

Sadly, the authors found it difficult to perform a credible phylogenetic analysis. In order to get a fully resolved tree (Analysis 1), the authors had to exclude 11 galaxies out of the 50. Even then, the interpretation of the cladograms was problematic, as the five epochs sampled did not show consistent patterns within the tree (they should follow the same time direction for each of the 10 galaxies).

The authors then noted: "At this stage of the analysis, two options are possible. The first one is to assume that, because galaxies ADEFI and BCGHJ are born with different burst components, they could have two different ancestors ... The second option is to remove the burst characters".

The authors tried both of these suggested analyses. In the first case (Analysis 2), they created two trees, one from each of the two taxon subsets, but they still had to exclude galaxy B2 to get resolved trees. In the second case (Analysis 3), the reduced set of 60 characters yielded a cladogram that required the exclusion of 20 galaxies in order to get a fully resolved tree. These 20 galaxies were then used to produce a second tree. So, in both cases they ended up with two trees, each based on different galaxies.

The authors' conclusion was:
"Among the different results presented in this paper, those shown in [Analysis 3] are clearly the most satisfactory because they are less affected by a priori subjective choice, and the evolutionary scenario represented on the cladograms is astrophysically plausible. On the contrary, the analysis using all characters [Analysis 1] is plagued by doubt on burst characters as galaxy evolution indicators. The other results [Analysis 2] heavily depend on our a priori knowledge of lineages available thanks to the simulations. They thus seem very artificial and cannot be representative of a real data set."

To me, this all seems overly complex. There are clearly multiple patterns in the data, and the first thing to do is find out what they look like. The issues raised here could easily be dealt with using a network as a tool for exploratory data analysis.

So, I took the dataset as presented in the paper, and performed a NeighborNet analysis. To do this I had to re-code the characters, because the SplitsTree program does not directly deal with ordered character states. So, each character becomes three characters, with the states coded as: 0 = 000, 1 = 001, 2 = 010, 3 = 011, 4 = 100, 5 = 101, 6 = 110, 7 = 111. The hamming distance then produces the correct distance for the ordered character states.

Click to enlarge.

The resulting network reveals the situation that the authors struggled to deal with. One of the largest splits creates two well-defined partitions of the galaxies, with galaxies A,C,D,E,F,I on the right and galaxies B,G,H,J on the left.

As noted by the authors, galaxy B2 is in the "wrong" partition (it is highlighted in the figure). However, we can also clearly see that galaxies B3,B4,B5 are themselves unusual compared to the other galaxies, as they have a very large split of their own. Galaxy B is thus highlighted as having a very unusual history, which is needs to be investigated separately from the other galaxies.

What is more important, the behaviour of galaxy C does not match the authors' a priori subdivision based on different burst components. The authors placed C with the BGHJ group whereas the network places C2,C3,C4,C5 with the ADEFI group (as highlighted in the figure). This explains why the authors found Analysis 2 unsatisfactory — their a priori subdivision of the taxa does not quite match the data.

The authors also considered Analysis 3 to be unsatisfactory because a large amount of the data were deleted, making "the total number of significantly discriminant characters somewhat too low to hope to obtain a very robust cladogram for the 50 galaxies." However, the network shows that this is not the real problem. The problem is that the data are not very tree-like, no matter which characters are considered. This suggests that these galaxies have not organized themselves into a hierarchy at all.

Indeed, to me the data make it clear that constructing a phylogeny for galaxy data is not a very useful exercise, at least if this is the sort of dataset that can be expected. Moreover, as the authors note: "The sample used in this paper is made of galaxies that are too simple as compared to the real world." In that case, as a proof of concept this analysis is not very convincing.