Wednesday, January 28, 2015

We don't normally discuss individual papers in this blog (except as example datasets), but today I am simply drawing your attention to what appears to be a little-known paper on phylogenetic networks.

Naruya Saitou has not contributed much to the theory of networks, being instead best known for the development of the neighbor-joining method for phylogenetic trees. (The 20th most cited paper ever; see Massive citations of bioinformatics in biology papers) However, this recent paper is of interest:

The paper presents a new method for detecting ancient recombinations through phylogenetic network analysis. Recent recombinations are easily detectable using alternative methods, although splits graphs can also be used, but older recombinations are more tricky.

Importantly, I particularly like the opening paragraph of the paper:

The good old days of constructing phylogenetic trees from relatively short sequences are over. Reticulated or "non-tree" structures are omnipresent in genome sequences, and the construction of phylogenetic networks is now the default for describing these complex realities. Recombinations, gene conversions, and gene fusions are biological mechanisms to produce non-tree structures to gene phylogenies, while gene flow is a well known factor for creating reticulations within population phylogenies.

These are heart-warming words from the developer of the most commonly used tree-building method!

Monday, January 26, 2015

It might be nice to live in a world where the mere fact that you are male or female does not attract attention to you within your profession. But while we are waiting for that day, you might like to ask yourself about women in systematics. David Archibald suggests that the tree produced by Anna Maria Redfield is "the first tree – creationist or evolutionary – by a woman and may well be the only such tree by a woman until well into the twentieth century."

Born at the dawn of the 19th century, Anna Maria Redfield earned the equivalent of a master's degree from the first U.S. institution of higher learning devoted to female students: Ingham University, and became perhaps the first woman to design a tree-like diagram of animal life. Although tree-like, her diagram didn't show common ancestry but instead showed the "embranchements" established by Georges Cuvier: vertebrates, arthropods, mollusks, and "radiata" (today classified as cnidarian and echinoderm phyla). To be fair, this diagram was published before Darwin's Origin of Species but later editions of her work made no mention of evolution either. Instead, she wrote about our simian cousins, "The teeth, bones and muscles of the monkey decisively forbid the conclusion that he could by any ordinary natural process, ever be expanded into a Man." Still, her elegant work is great fun to behold even now.

The tree-like diagram (shown in miniature above) was a wall chart (1.56 x 1.56 m) called A General View of the Animal Kingdom, published in 1857 by E.B. and E.C. Kellogg, New York. It is heavily illustrated with images of the taxa, their names, and brief notes: eg. "Man alone can articulate sounds, and is capable of improving his faculties or advancing his condition". Only three lithograph copies of the original tree are now known, one of which was sold at auction by Christie's in 2005 for £7,200.

The following year the same publishers produced a companion volume to the chart, called Zoölogical Science, or Nature in Living Forms: Adapted to Elucidate the Chart of the Animal Kingdom, and designed for the higher seminaries, common schools, libraries, and the family circle (1858, reprinted 1860, 1865, 1874). A copy is available in the Biodiversity Heritage Library. Only 57 original copies of the book are now known.

This book of 743 pages is richly illustrated, the artist being unacknowledged in the first edition but credited as E.D. Maltbie from then on. (He is presumably responsible for the chart as well.) The book has the frontispiece shown below, which is an edited version of the base of the tree.

The wall chart is a masterpiece, with intricate and accurate illustrations of representatives of the animal kingdom portrayed as a Tree of Life, which illuminates the relationships of the major groups of organisms. It is an important document in the study of biology and in the pioneering work of women in science. The wall chart has eloquent phrases, which express a Victorian humanistic view of nature (often intermingled with anthropomorphism, biblical overtones, and the biological superiority of humans).

Redfield's views on evolution are clear from her book, indicating that the relationships shown represent affinity not evolution:

There is no evidence whatever that one species has succeeded, or been the result of transmutation of a former species.

Butts notes that unfortunately Redfield "remains a relatively minor and poorly recorded figure in the history of women in science, let alone biological and evolution studies in general."

Wednesday, January 21, 2015

Charles Darwin's metaphor of the Tree of Life was not a tree, even in The Origin of Species. As noted by Franz Hilgendorf (see The dilemma of evolutionary networks and Darwinian trees) "the branches of a tree do not fuse again", and yet in his book Darwin discusses at least one circumstance when they do precisely that — hybridization.

Darwin's discussion of hybridization occupies all of chapter 8 of the Origin. His stated motivation is to address what many people might see as a fatal objection to his theory of species origins by means of natural selection. One of Darwin's main arguments in the book is that "descent with modification" is continuous, and therefore the distinction between species and varieties (and subspecies, etc) is an arbitrary cut in a continuum of biodiversity. However, it was conventionally accepted that varieties within the same species could cross-breed freely, but any attempt to hybridize distinct species would always fail. Darwin opposes this view by citing extensive evidence showing that varying degrees of sterility are encountered in efforts to cross-breed different species of plants (and a few birds) — if the species are closely related then often there will be a small degree of fertility in the hybrid offspring. So, as two related forms diverge from one another in the course of evolution, their ability to inter-breed gradually diminishes and eventually falls to zero (absolute sterility).

It is important to note that his motivation for writing about hybridization was independent of his ideas about phylogeny. So, he seems not to have noticed the consequence of hybridization for phylogenetic patterns.

This is similar to the situation regarding his so-called "tree diagram", in chapter 4. His motivation for the diagram (the only figure in his book) was a discussion of descent with modification, and particularly the continuity of evolutionary processes. He was expressing his idea about uninterrupted historical connections. In particular, this was part of his concern that there is no fundamental distinction between varieties and species, because evolutionary divergence is continuous — it is all a matter of degree, without sharp boundaries. His Tree of Life image expressed the continuity of evolutionary connections, not phylogenetic patterns. This is clear from his poetic invocation of the biblical Tree of Life, which is about the inter-connectedness of all living things along tree branches, not about patterns of biodiversity.

Implicit in this world view is the idea that the Tree of Life is still a tree in spite of hybridization. That is, Darwin failed to see that his "tree simile" (chapter 4) had to ignore hybridization (chapter 8) in order to work. His figure does not show any evidence of hybridization, only divergence. It was not intended to be what we would now call a phylogeny, but merely an idealized view of divergence and continuity of descent. When introducing the Tree of Life, he was using religious imagery to stimulate the imagination of his readers, and in so doing presented a contradictory argument — there is continuity along the branches as well as continuity of inter-connections.

The alternative conception is that Darwin's Tree of Life was never a tree — it was a network. From this world view, Hilgendorf's dilemma was actually irrelevant. He commented:

An observation which, as far as I know, contradicts these previously discussed views, [would be], that formerly separate species approach each other and finally merge with each other. This would not fit the beautiful image that Darwin presented about the connection of species in a branch-rich tree; the branches of a tree do not fuse again.

Monday, January 19, 2015

The Tree of Life and the Tree of Knowledge are images that have appeared in many cultures throughout the world. They are often combined as a cosmic or world tree, with the tree of knowledge supporting the heavens and earth and the tree of life connecting all living beings. However, the word "tree" is obviously rather nebulous in these images, and it can take many forms.

In the christian Bible these trees appear in the garden of Eden in a more restricted form as the Tree of Eternal Life and the Tree of Knowledge of Good and Evil. Even here, though, it is not clear whether they are one and the same tree. For example, only one tree is mentioned in the book of Revelation, when promising a new Eden.

The Tree of Knowledge was co-opted in Medieval times as a symbol of learning, and a metaphor for arranging all human knowledge, the Arbor Scientiae (see Relationship trees drawn like real trees). This idea was adopted by biology in the 1700s, where trees were used as metaphors for the relationships among biological species. In modern parlance, these depicted affinity or phenetic relationships, and so they represented knowledge (not life). In the mid 1800s Charles Darwin (in the Origin of Species) took this pre-existing tree idea and instead made it represent evolutionary relationships among species. In the process he re-named it the Tree of Life, thus once again uniting the Tree of LIfe and the Tree of Knowledge. We have been stuck with the ToL name ever since.

At about the same time as the rise of the Arbor Scientiae, a combined Tree of Life and Tree of Knowledge also appeared as the central mystical symbol of the Kabbalah of esoteric Judaism, consisting of the 10 Sephirot (enumerations). It is shown above in its full modern form. This is a reinterpretation of the Hebrew Bible, conceptually representing a list the attributes of God (how God emanates).

In the Kabbalist view, both of the trees in the biblical garden of Eden were alternative perspectives of the Sephirot. The 10 Sephirot are arranged into three columns, with 22 Paths of Connection. As a tree, it has roots above and branches below. To quote Wikipedia:

Its diagrammatic representation, arranged in 3 columns/pillars, derives from Christian and esoteric sources and is not known to the earlier Jewish tradition. The tree, visually or conceptually, represents as a series of divine emanations God's creation itself ex nihilo, the nature of revealed divinity, the human soul, and the spiritual path of ascent by man. In this way, Kabbalists developed the symbol into a full model of reality, using the tree to depict a map of Creation.

My main point here is that by combining two conceptual trees this icon is clearly a network, unlike most other conceptual trees such as the dichotomous Tree of Knowledge.

The Kabbalah started without an image, being described solely in words. The diagram of the Tree used by modern Jewish Kabbalists is usually based on the diagram published in the print edition of Rabbi Moses Cordovero's Pardes Rimonim from 1591 [composed 1548], and sometimes called the "Safed Tree". It is shown in the next figure.

One of the earliest illustrations comes from the 1516 Portae Lucis of Paolo Riccio, a Latin translation of Joseph ben Abraham Gikatilla's most influential kabbalistic work, Sha'are Orah (Gates of Light) from the 1300s. It is shown in the next figure.

There are actually two modern version of the Kabbalah. The one shown here in the first illustration has the crossing diagonals lower down than does the one shown in the second illustration. The one with two diagonals at the bottom is an earlier version that is still favoured by Hermetic Kabbalists. Both made their first public appearance in the Pardes Rimonim.

Wednesday, January 14, 2015

BLAST is a computer program that searches a database for similarity matches to a given query sequence, either DNA or amino acid. It is most commonly used to search the GenBank database for matches to any new sequence that we might happen to have, in the hope that we will find one or more homologous sequences.

To most of us BLAST is a black box, in the sense that we have little idea about the details of how it does what it does. So, maybe we should at least look at what it does, just in case we ever need to know.

About 10 years ago I was working with some EST data. For those of you not old enough to know, ESTs consist of short DNA reads from arbitrary primers. In the hope of identifying the coding gene represented by each EST, BLASTX is used to search the GenBank protein database using each translated nucleotide query (in all six possible reading frames). BLASTX produces an E-value for each matching sequence, representing the strength of the match to the query. An E-value is not a probability (as they can vary from 0 to infinity), but at p=0.050 the expected E-value happens to be E=0.051. There is no consensus for what E-value should serve as indicating a "significant" match.

I decided to find out what happens if a DNA query sequence varies in either length or GC content. I used both random sequences (which were thus not in GenBank) as well as real sequences (which were in GenBank). The short answer is that the BLASTX results vary a lot. I never published these results because I figured the first thing a referee would do is ask me to explain BLASTX's behaviour, and I did not have an explanation (and still don't).

I present the results here for what they are still worth. Obviously, the results are not restricted to EST data, but apply any time that we use BLASTX.

Experimentation

The content of GenBank is quite different today to what it was back in late 2003, and so maybe the results will vary if the work was to be repeated. For reference, the first graph shows the GC content of the GenBank protein-coding sequences at the time of my work. Also, it is possible that BLASTX is different as well — I used v. 2.2.6 with default parameters (BLOSUM62, edge correction, length correction, SEG filtering, universal genetic code, gap penalty 11+k). Maybe some intrepid soul will be inspired to find out what happens nowadays.

Random sequences

I generated sets of 1,000 replicate "ESTs" using the perl script Randseq by M. Raymer (5/27/2003). These sets varied in DNA length (100–1,000 nt) and in GC content (0–94%), but were otherwise random sequences of nucleotides. These sequences are not expected to be homologous to anything already in GenBank, and should thus form BLASTX matches only by random chance.

The results for varying the sequence length are shown in the next graph, with each point representing the mean E-value observed. The lines represent four somewhat different GC contents; and the anticipated E-value for random data (0.051) is also shown. Clearly, very few points are near the expected value. The lines all show the same shape, with a minimum E-value near 450 nt, and rising slowly with longer lengths and rising rapidly with shorter lengths.

A more detailed assessment of the results for varying the GC content is shown in the third graph. The lines represent two somewhat different sequence lengths; and the anticipated E-value for random data (0.051) is also shown. It is clear that the E-value is capable of varying by up to seven orders of magnitude in response to variation in the GC content of the sequence.

Real sequences

I used the sequences contained in the Poxvirux Orthologous Clusters database (POCs), which used to be available at: http://athena.bioc.uvic.ca/pbr/POCs/pocs.html. This has since been replaced by the Viral Orthologous Clusters database (VOCs). These virus protein sequences are expected to already be in GenBank, and they should thus form good BLASTX matches.

The POCs database could be queried by both sequence length and GC content, and it was the only such database that I could find at the time. For each combination of length (in 50-nt bands) and GC-value (in 10% bands) I gathered a minimum of 20 sequences. There were few sequences for the shortest lengths, so I chopped up the longest sequences (longer then needed) to increase the sample size. There were also few sequences at the greatest GC values, so I used sequence AE004437.1 from GenBank (a Halobacterium sp.) to increase the sample size.

The results are shown in the final graph, with each point representing the mean E-value observed. The E-values are all small, since they represent actual database matches. Clearly, variation in sequence length can lead to orders of magnitude variation in E-value, while variation in GC content has an effect only at longer sequence lengths.

Conclusions

For a program that is supposed to produce comparable results, no matter what the sequence, these BLASTX results are disquieting. After all, BLAST is one of the most cited programs ever (see Massive citations of bioinformatics in biology papers), and yet I suspect that most people do not realize that it behaves like this.

The random sequences assess the effect of false positives. That they vary so much in E-value is amazing. Clearly, BLASTX E-values are not comparable between sequences. It is interesting that GC content seems to have a bigger effect than sequence length — for any given GC content the effect of length is relatively small for sequences longer than c. 600 nt. However, variation in GC content can produce orders of magnitude of effect at any given sequence length.

The real sequences assess the effect of true positives. That they vary in E-value is also not good — the E-values all represent true database matches (and presumably exact ones). Nevertheless, the effect of variation in sequence length and GC content is repeated for these real sequences. However, variation in GC content only has a large effect for the longer sequences, and instead it is the sequence length that produces the orders of magnitude variation in E-value.

Monday, January 12, 2015

To a modern phylogeneticist the answer to this question is obviously "no". Phylogenetic trees occur in the literature with their root at the top, the left or the bottom, and more rarely on the right. The graph has the same interpretation no matter where the root is placed, as all of the edges are implicitly directed away from the root. The tree can even be circular, with the root in the centre and the tree radiating outwards.

However, this was not always so for genealogies, and indeed this freedom seems to be a product of the past 200 years or so. The history of tree orientation has been discussed in detail by Christiane Klapisch-Zuber (1991. The genesis of the family tree. I Tatti Studies in the Italian Renaissance 4: 105-129).

Originally, genealogies were drawn with the root at the top, as shown in previous blog posts: The first royal pedigree, and The first known pedigree of a non-noble family. These pedigree trees (ie. genealogies of individuals) have a particular ancestor at the root of the "tree", so that the tree expands forwards in time down the page, to increasing numbers of descendants at the leaves (ie. a "descent tree"). This made linguistic sense, because people "descended" from the ancestor down the page. In European languages pages are read top to bottom, and so the natural reading order was the same as the time sequence.

However, this arrangement makes no sense if one refers to the graph as a "tree". Trees have their root at the bottom, not the top. Trying to draw the pedigree as a tree while retaining the original orientation could lead to unusual results, as shown in the first figure, from the end of the 1300s CE (from Universitätsbibliothek, Innsbruck, ms. 590, folio 116r). This is actually an Arbor Consanguinitatis rather than an empirical pedigree — it shows the various relatives of a nominated individual (the man pictured in the center) and their degree of relationship to that person. These diagrams have been used to compute which relatives can marry without committing incest, or which can inherit if a person dies intestate. Jean-Baptiste Piggin, at his web site Macro-Typography, has noted that the earliest known examples are from the 400s CE.

In order to match a real tree, the genealogy has to be read from bottom to top. This implies an ascent through time, instead, with a spreading out of the family upwards through time.

The first known empirical pedigree in which the ancestor is at the base is the Genealogia Welforum, the pedigree of a dynasty of German nobles and rulers (Dukes of Bavaria, and Holy Roman Emperors, successors of the Carolingians). The earliest known example, drawn as part of the Historia Welforum [Welf Chronicle], is shown in the second figure (from Hessische Landesbibliothek, Fulda, ms. D.11 folio 13v). The original text version of the pedigree is dated 1167-1184 CE, with the miniatures added sometime from 1185-1191 CE.

Clearly, this diagram is only sketchily like a tree, with many of the people placed along the main trunk, and medallions hanging off for other relatives. This seems to arise from the pedigree's origin as prose, and the subsequent literal illustration of that prose.

The ancestor is labeled "Welf Primus", and he apparently lived in the time of Charlemagne (the best known of the Carolingian dynasty). The empty space at the top of the chart was apparently intended for a picture of Emperor Frederick I Barbarossa, of the House of Hohenstaufen. The woman at the top right is Henry the Black's daughter Judith, who was the mother of Barbarossa. Intriguingly, the final bend of the Welf trunk to the left, combined with Barbarossa at the top, seems to imply that it is the descendants of Barbarossa who continue the Welf lineage, rather than those bearing the Welf name.

Historically, it seems to have been the proliferation, after about 1200 CE, of illustrations of the biblical Tree of Jesse that popularized the idea of "pedigrees as trees". The next figure shows such a tree from c. 1320 CE (from a Speculum Humanae Salvationis manuscript, Kremsen ms. 243/55). Jesse lies at the base of the tree, and the tree actually arises from him. His descendants then ascend to Jesus, shown at the crucifixion, with Heaven illustrated at the top. The tree thus uses Christ's pedigree to symbolize the ascent of humans to heaven (via his crucifixion), rather than simply the descent of humans through time. That is, the tree correctly represents ascent (as well as descent).

This leaves us contemplating just when we added the final twist to the iconography, by putting a single descendant at the base of the tree, and having the ancestors branching out above as leaves (ie. an "ascent tree"). This means that time flows from the top to bottom of the figure, even though the tree is oriented from bottom to top. This is quite illogical as an analogy, given that the base of a real tree is the origin of its growth (see Goofy genealogies). This particular iconography is not used for phylogenies but is very commonly used for pedigrees.

I have no idea when this first occurred. However, David Archibald (2014. Aristotle's Ladder, Darwin's Tree: The Evolution of Visual Metaphors for Biological Order. Columbia Uni Press) draws attention to a very tree-like pedigree of Ludwig (Louis III), fifth Duke of Württemberg, from the late 1500s, shown here as the final figure (from Württembergisches Landesmuseum, Stuttgart). Ludwig is at the base of the tree, and ironically he had no descendants (although he married twice). His parents are above him in the tree (Christoph, Duke of Württemberg, to the left, and Anna Maria von Brandenburg-Ansbach, to the right), followed by four further ancestral generations. Note the leaves and hanging fruits, which highlight the tree metaphor.

Wednesday, January 7, 2015

Sometimes there has been discussion about the structural complexity of phylogenetic networks. At one extreme, species phylogenies are seen as trees with occasional reticulations, and at the other end there is a whole cobweb of reticulations with no visible tree. In this context, comments are sometimes made about the likeliness of those outputs from network programs that show extensive gene flow. If a biologist does not believe that the history of "their" organisms involves extensive reticulation, then the algorithmic outputs might be dismissed as unrealistic.

Here I present one well-known example of extensive hybridization, in which the computer programs seem to agree on the same complex solution — the history of common bread wheat.

The hybridization network shown above is a montage of two different phylogenies from the original paper. It shows four splits, one homoploid hybridization, and two polyploid hybridizations. The time is shown in the circles in units of millions of years (note that the scale is not linear).

The first split (6.5 million years ago) is between the genera Triticum (wheat) and Aegilops (goatgrasses), which are morphologically highly distinct, with Aegilops having rounded glumes rather than keeled glumes. There are currently c.20 recognized species in both Aegilops and Triticum, so only a small part of the diversity is shown in the network.

Domesticated Bread wheat (T. aestivum) is a hexaploid species, with the three diploid genomes being known as A, B and D. Their lineages are labeled and colored in the network diagram. The genome D lineage is the result of a homoploid hybridization (which has been taxonomically treated as part of Aegilops). Bread wheat is then the recent result of two successive allopolyploid hybridizations, with a tetraploid lineage as the intermediate.

Of the other species shown in the network, all of the goatgrasses are wild diploid species, as is T. uartu. T. monococcum is also diploid, with domesticated Einkorn wheat being derived from the wild ancestor. T. turgidum is a tetraploid species, with domesticated Emmer wheat being derived from the wild ancestor — it has recently diversified into many modern wheat species.

This is one of the most complex phylogenetic networks known, although that complexity is at least partly the result of leaving out most of the other diploid species in the Triticum and Aegilops clades. Program outputs that are more complex than this are unlikely to be realistic.

Sunday, January 4, 2015

Networks are visually more complicated than trees, because there are extra edges representing reticulate relationships. Technically this means that some of the nodes have in-degree >1, and that there are one-to-many connections among these nodes. This can create visual clutter. I recently presented one simple way that might alleviate this (Circular phylograms for phylogenetic networks).

Another possibility is to add to the network what are called meta-nodes. These meta-nodes represent groups of nodes, so that the edges between the meta-nodes and the other nodes can represent different types of relationship. This reduces the one-to-many connections in the graph.

As pointed out by Elijah Meeks at the Digital Humanities Specialist blog, pedigrees represent a neat example of this concept. In this example, there are several types of traditional relationship that can be represented: husband, wife and child. Since these relationships are explicitly shown (ie. the direction of the relationship is explicitly shown), the figure can be drawn unrooted.

The example shown here (reproduced from Meeks' post) has the meta-nodes in grey, each representing a family. These nodes are unlabeled, while the person-nodes are labeled with the person's name and noble title. Females have pink nodes, and males blue ones. The edges connecting them to the grey nodes are colour-coded as: blue = husband, pink = wife, orange = child.

So, for example, the right-hand family node indicates that Charles I and Henrietta Maria were husband and wife, and that they had three children: Mary Henrietta, James II and Charles II.

In this case, the reduction in one-to-many connections does make the relationships more clear, so that interpretation is easy. However, it potentially makes the network more complicated (as Meeks notes) because of "just how tangled up certain families can be" — adding the extra meta-nodes exacerbates the tangling. Meeks provides another example in his blog post.