Tuesday, April 26, 2016

Phylogenetic methods have been applied to all sorts of research fields, including biology, linguistics, stemmatology and archaeology. There are many posts in this blog discussing examples of these applications, both good and bad.

However, some time ago a paper appeared that tried to apply these methods to data, instead:

The authors do a creditable job of describing phylogenetics for the uninitiated, but I am not convinced that their empirical application to "digital objects" works particularly well.

They describe their application as follows:

The digital objects under examination are different versions of the International Comprehensive Ocean and Atmosphere dataset (ICOADS).

ICOADS data consist of marine surface measurements and observations (e.g. sea-surface temperature, sea-level pressure, wave swell, wind direction, etc.) that have been digitized from historical ship logs, or taken from floating buoys. As a result of the broad time periods that the dataset covers (approximately 450 years, 1662–2014) the quality and reliability of the data varies considerably.

Much like a piece of software, ICOADS is an evolving dataset with intermittent releases. Version 1.0 – called simply COADS – was publically [sic] released in 1987, and contained almost 100 million historical observations starting in 1854 and continuing to 1979.

Thus, understanding the ways in which ICOADS evolved into new versions, and gave rise to "offspring" datasets over a thirty-year period is the focus of the case study presented below.

The problem here is that tere is no implication that any of these characters are phylogenetically informative (ie. inherited), and thus that shared features might represent synapomorphies. In applications to linguistics, stemmatology and archaeology, on the other hand, it is at least likely that shared similarities might represent synapomorphies.

Given these data, the analyses cluster the datasets based on similarity — indeed, the authors explicitly refer to their tree-based analyses as "clustering algorithms". However, this form of analysis does not necessarily reveal history, in the sense that none of the analyses are explicitly historical. Historical patterns will be included in the outcome, but they will not necessarily be separable from patterns resulting from any other source. The resulting groups of datasets may or may not have historical meaning. The authors do, however, have a series of hypotheses (the groups) that can now be subject to scrutiny for possible historical interpretations.

For our purposes it is also worth noting that the authors do recognize one limitation of their analytic approach when applied to datasets:

A purely tree-based phylogenetic approach is also incapable of showing the exchange of traits between different lineages of digital objects, or cases in which several organisms merge into one; thus a reticulating network may be needed in lieu of a bifurcating tree.

Tuesday, April 19, 2016

The Online Etymology Dictionary indicates that the English-language expression "Family tree" in the sense of "graph of ancestral relations" is first attested from 1752, in the novel A Genuine Account of the Life and Transactions of Howell ap David Price (which is available in Google Books).

Such pedigree diagrams have a much longer history, of course, but they were not called family trees, nor were they drawn with any particular tree-like imagery (except for the religious Tree of Jesse, pictures of which started appearing in the 10th century). See, for example:

Ernest H. Wilkins (1925. The genealogy of the genealogical trees of the Genealogia deorum. Modern Philology 23: 61-65) has suggested that it might be the Italian author and poet Giovanni Boccaccio (1313-1375), in his Genealogia Deorum Gentilium (On the Genealogy of the Gods of the Gentiles).

This Renaissance book was an "encyclopedic compilation of the tangled family relationships of the classical pantheons of Ancient Greece and Rome" (according to Wikipedia). It was written in Latin, apparently starting in c. 1350, and then continuously corrected and revised until the author's death. In c. 1370 an apograph [ie. perfect copy] was made of an autograph manuscript [ie. in the author's own hand], and from that first apograph other copies were made.

The 1370 autograph is not known to still exist; but a second autograph manuscript, showing later revisions, is in the Laurentian Library in Florence (MS. LII, 9). There are some three dozen extant apographs from the 1300s and 1400s, all based on the lost first autograph. The first printed edition was produced in Venice in 1472, followed by an edition of 1473 printed in Leuven. At least seven other editions appeared during the 1400s and 1500s. A French translation was published in Paris in 1498, and an Italian translation appeared in Venice in 1547. (See Ernest H. Wilkins. 1919. The genealogy of the editions of the Genealogia Deorum. Modern Philology 17: 425-438.)

The illustrations shown here are from various versions of the book.

Wilkins (1925) notes:

The extant autograph manuscript of the Genealogia Deorum of Boccaccio is illustrated by thirteen genealogical trees, designed certainly and drawn in all probability by Boccaccio himself. At the top of each tree is a large circle, in which is written the name of a divinity. From this circle descends a stem which now expands into other lesser circles, now sends forth leaves, and now branches, which in their turn expand into circles and send forth leaves and lesser branches. In the center of each circle or leaf a name is written. The circles are used for those divinities whose progeny is represented in the same tree; the leaves, for divinities whose progeny is not represented. In the circles the words qui genuit [ie. who fathered] follow each masculine name, and the words quae peperit [ie. who bore] each feminine name. Similar trees certainly appeared in the earlier lost autograph, from which all the apograph manuscripts are derived; and similar trees appear in several apographs, and in the fourth and all later editions of the Genealogia.

So far as I can ascertain, Boccaccio's trees are the earliest secular genealogical trees properly so called: that is to say, the first non-biblical genealogical charts in which stems, branches, and leaves appear.

This claim of priority has apparently gone unchallenged by later workers; eg. Christiane Klapisch-Zuber (1991. The genesis of the family tree. I Tatti Studies in the Italian Renaissance 4: 105-129) notes:

It may well be that Boccaccio was the first to combine the old graphic system of medallions in the descending order typical of medieval genealogies with the implications of a vegetal theme.

The vegetal image is quite obvious, although the leaves do vary widely in form within any one manuscript, and also from copy to copy. In the autograph they are palmately five-lobed. In some trees the different generations are indicated by variation in the colour of the branches.

Personally, to me each of these diagrams looks more like a vine than a tree, especially with the root at the top.

Moreover, some of the printed editions do not contain the genealogies, and in others their form is modified. For example, some have a portrait of the progenitor divinity, and others bear scrolls or circles instead of leaves. Some of the trees have extra (empty) leaves or scrolls. It is thus quite clear that the tree metaphor for the pedigrees was not seen as important at the time.

Nevertheless, it is important to note that in the first two editions of the Italian translation by Giuseppe Betussi (1547 & 1554; but not in later editions) the first genealogy is drawn as an actual tree rooted in the ground, with the name of the progenitor appearing at the base of the trunk. Klapisch-Zuber notes:

In comparison with Boccaccio's divinely radiant foliage, this image must strike us as mean and desiccated. And yet, it is the triumph of the genealogical tree as we know it, planted right side up; and any one in the modern world can use it to evoke his ancestors and to express his faith in the survival of his lineage.

Wednesday, April 13, 2016

When playing the cognate hunting game or the etymology identification game in
historical linguistics, there are many different rules that one needs to keep
in mind. Words that look similar are not necessarily related — they could be
simple look-alikes (Trask 2000:202). If words are too similar, they could be borrowings. If we quote colleague
X from the camp of linguists believing in theory t₁ we should make sure
that we also quote colleague Y from the camp of linguists believing in the
theory t₂, especially if we do not know the peer reviewers, etc.

A particularly
important rule that is often surprising for biologists is the rule that says we can only compare languages that we know are related. We
could, of course, compare all languages in the world (and people do compare all
languages in the world), but the point is that we are not allowed to compare languages
historically unless we know whether they share a common origin. This rule is
reflected in a long-standing debate regarding the question of how we can
prove that two languages are related. Here, we have basically two opposing
camps, one claiming that only grammar can prove language relationship, and one
claiming that only the lexicon is suitable for that task (Dybo and
Starostin 2008, Campbell and Poser 2008).

That we have to prove that two or more languages are related before we can
start to compare them is in strong contrast to biology. The idea of multiple origins as an alternative to a single origin itself has also been discussed in evolutionary
biology (David has shown this in an earlier blogpost dealing with networks with multiple roots). In linguistics, however, we are largely agnostic regarding the common origin of
all languages, and the degree of agnosticism may go even so far that it acquires a
missionary zeal. Attempts to explain how language evolved, that is, how
language originated as a means for communication, always run the danger of
being ridiculed by the linguistic community. Under very bad circumstances, they
can even cast a very dark shadow on the linguistic reputation of those who
proposed them.

Affirming our disinterest in the origin of language has a long tradition.
In its Statuts from 1866 (published in 1871), the
Société de Linguistique de Paris declared that it would not support any
research on the origin of language. Even August Schleicher, the father of the
language tree, affirmed this attitude in a letter to Ernst
Haeckel (Schleicher 1863: 22), where he wrote:

It is impossible to presuppose a material descent of all languages from a single proto-language. (My translation, original text: "Eine so zu sagen materielle Abstammung aller Sprachen von einer einzigen Ursprache können wir also unmöglich voraussetzen.")

Although it is not
explicitly spelled out nowadays, these statutes are still active in most
linguistic institutes.

Being agnostic about the origin of language means that we cannot exclude the
possibility that two languages, like, say, Chinese and English, are ultimately
not related at all. And if they are ultimately not related, it would be futile
to compare them with the hope to find linguistic material that goes back to
their common ancestor. Biologists, who usually take the Tree of Life for
granted (albeit a bush in the end), might ask themselves for the reasoning behind this agnosticism in linguistics. The reasons are rather simple to state: If we
make the very conservative assumption, based on archeological records, that
human language originated about 100,000 years ago (Dediu and Levinson
2013), and contrast it with the first written records of
languages (about 5,000 years ago), and the presumed time depths of our current
comparative method (Meillet 1925, Weiss
2014), which optimistically allows us to reach out 10,000
years back in time, we simply do not have the means to make any qualified
linguistic hypothesis regarding the origin of all those 7,000 and more
languages spoken today (count based on Hammarström et al. 2015).

The reasons why linguists prefer to maintain an agnostic attitude are
completely comprehensible for me. Whether it is good to be agnostic, is
another question. And whether it is good to be as militant as are some linguists regarding the question of language origin is yet another one. For the
context of evolutionary biology, for example, a little bit of agnosticism
regarding the Tree of Life might bring up interesting dynamics. The same could
be said about a little bit of "faith" in linguistics, be it that one believes
that language originated independently in multiple places at the same or
different times, or be it that one supports a monophyletic origin of a
"Language of Eden". Neither of the theories has immediate impact on the way
we pursue our historical comparison of languages. Even under a monogenesis
assumption we would still need to prove a close affinity between languages
before we could start comparing them with our traditional methods.

In the long
run, however, it might help us to get some of the tension out of our
long-standing debates. If we took monogenesis for granted, for example, people
would be less afraid of comparing random pairs of languages, and in the long
run we could gain new insights into distant relationships. If we rejected
monogenesis, on the other hand, we could try to identify how many times language
originated independently.

It is (and here you see my own agnostic attitude) not really important whether we stick to monogenesis or polygenesis in the end. What is important is that we are clear about the consequences that either of these two theories might have on our research in the future. Agnosticism is a useful attitude as long as it does not prevent us from asking questions. Following up on David's earlier blogpost,
it seems clear to me that especially linguists might profit a lot
from rooted network approaches that allow for multiple roots, since it would allow us to keep our agnosticism without suppressing our curiosity.

Monday, April 4, 2016

The drawing of large genealogies is not easy, and phylogeneticists (among others) have tried a number of solutions, including circular diagrams as we as interactively zoomable displays. One interesting solution that does not appear to have yet been used in phylogenetics is the concept of GeneaQuilts.

The web page has a video introducing the concept, which does a better job than I can do here. The basic idea is to abandon the tree / network representation, and to use a diagonally-filled matrix instead, where the rows are individuals and the columns show parent-offspring relationships.

Here is an example genealogy, based on the reported relationships among the Greek Gods.

If the relationships are tree-like then the diagram will be concentrated on the diagonal of the matrix. However, network relationships (inbreeding) will cause off-diagonal elements, two of which are shown in the example: one involves Hades and his niece Persephone.

Several, much larger examples are displayed on the GeneaQuilts website. There is a program that can be downloaded, which takes as its input standard family-history files.

There seems to be no intrinsic reason why this display form could not also be used in phylogenetics.