Tuesday, July 26, 2016

Of course they can. Biologists who know nothing about linguistics can learn a
lot about linguistics from linguists, including the most nerdy, the most
boring, and the most interesting things.

However, it is obvious that the
question in the title of this post implies a different object of learning, and
a more precise title would have been "Can biologists learn about evolution from linguists?" As a linguist, I would of course also provide an affirmative answer, but I
doubt that most biologists would agree. At the moment, we have a situation in
which the majority of interdisciplinary papers state that linguists can learn
from biologists. The opposite, that biologists can learn from linguistics, can rarely be found.

The tenor of most recent studies, especially in the literature published during the past one to two decades, is often that we finally
realize that language evolution is largely the same as biological
evolution, surprisingly (for a recent account in this direction, see Pagel
2016). As a result, it is claimed that we can easily use biological methods to
study language evolution. We need to use them, since linguistics is in a poor state with no methods of its own, and linguists have never quantified what
they know about the history of their languages. Then, finally, with these new
methods developed in biology, we see light at the end of the tunnel, and we can
draw nice trees of our languages and see how they evolved into their current
shape.

I am in complete favour of increasing the objectivity in historical linguistics,
making it a more data-driven and a more transparent discipline. I also advocate
interdisciplinary transfer of methods and models, and there are quite a few
things we can actually learn from biologists in linguistics. What I do not
like is this tone, which suggests that biology is the discipline that saved
linguistics,
waking it up from its 200-year-long sleep in the ivory tower. At the same time, I also do not like the horror-scenarios in traditional linguistics, which state that quantitative approaches would deprive our discipline of all its wit (see the figure below as a not too serious attempt to visualize these two perspectives). In this
context, it is quite interesting to look back in history and to
recapitulate what actually happened.

The biological storm of bits and bytes: Will it destroy the ivory tower of historical linguistics
or ultimately help it to shine with a new gloss?

The discipline of historical linguistics is about 200 years old, starting with the legendary scholarly work of poeple like Rasmus Rask (Rask
1818), Jakob Grimm (Grimm 1822), and Franz
Bopp (Bopp 1816). Using family trees to model language history
goes back to the 17th century, pre-dating the first networks in biology by one
century (see David's overview in Morrison 2016). The first
explicit alignments showing homologous sounds across words occur at least as
early as the beginning of the 20th century (Dixon and Kroeber
1919), cladistic frameworks date back to the second half of
the 19th century (Brugmann 1886), and even algorithms for
tree reconstruction based on distance data occur back in the 1960s
(Dyen's comment in Hymes 1960).

The discipline of historical linguistics can look back on a remarkable history
of excellent scholarship. Thanks to this scholarship, we have gained invaluable
insights, not only into the history of the world's languages, but also into the
mechanisms that trigger linguistic diversity. It is undeniable that methods
from evolutionary biology have given us some fresh insights during the past
20 years, but their actual influence is often exaggerated. On the one
hand, our experience (since the quantitative turn in historical linguistics)
shows that in most cases we cannot use biological methods to analyze our data
directly. Instead, we need to carefully adapt them to our needs in order to
get the best out of them (as I have tried to show in more detail in List 2014).

On the other hand,
there is no example during the past 20 years, that I would know of, where the
modern biological methods have really revolutionized our insights into
language history. They have undeniably shifted our attention towards data and
quantification. They have exposed weak spots, in our argumentation, and they
have forced us to restate questions that we had forgotten to ask. But no new
language family has been detected, no deeper genealogies between existing
languages have been proposed, and no deeper insights into human prehistory have
been achieved by the use of biological methods alone. Historical linguistics has profited from evolutionary biology, but not as a small oasis in the
desert that was given water and seeds by the lords of bits and bytes, but as a
discipline in which scholars learned to make active and critical use of
interdisciplinary approaches.

Linguistics to biology

This brings us back to the question of the title. Can biology learn
from linguistics? It has done so undoubtedly in the past. Tree-drawing
in biology, for example, was popularized by Ernst Haeckel who himself
became influenced by the linguist August Schleicher (Sutrop 2012: 300). In the early days of genetics, a multitude of metaphors were borrowed from
linguistics to describe biological phenomena with words like "alphabet", "word" (Gamov 1954), or "translation" (Crick 1959).

While not all biologists have been in favor of this tendency (see, for example, Shanon 1978),
and the borrowing of terms does not necessarily imply methodological
transfer, we also find examples for the explicit transfer of methods and
theories from the linguistic to the biological domain. As an example,
consider the theory of formal grammar (Chomsky 1959) which still plays a very important role in addressing certain problems in bioinformatics (Searls 1997),
like RNA folding and protein structure analysis. Biological textbooks
on sequence comparison still tend to include a chapter on formal
grammars and their application in biology (Durbin et al. 1998).

Biology could also profit from linguistic insights in the future, and this becomes
a bit clearer when we recall, what Schleicher mentioned 150 years a go (and what has been obviously forgotten since then):

Observing how new forms descend from old ones can be done more
straightforwardly and in a larger scale in linguistics than in biology. For
once, the linguists have an advantage over the natural scientists.
(Schleicher 1863: 18, my translation)

The advantage of linguistics, which Schleicher points out, is the
availability of very concrete, very detailed, very valuable data in
linguistics. This data allows us to see evolutionary forces in a detailed way of
which biologists can only dream. Written sources allow us to trace the history
of whole language families like Romance (and to some extent also Chinese
dialects) from their ancestral speech varieties down to today. Language change is
fast enough to allow us to investigate it in action. Recent topics in
biology, like the importance of invoking a system perspective in evolution,
have been long since debated and discussed in linguistics (Tynjanow and
Jakobson 1928, since they are so much easier to detect.

In
the past, when I worked intensively on the implementation of the Minimal
Lateral Network method (Dagan and Martin 2007, Dagan et al.
2008) on linguistic data (List et al. 2014,
List 2015), I stumbled upon numerous examples showing the
limits of tree topology as a predictor for lateral transfer events. Given that
the same necessarily also holds for lateral gene transfer, I was asking myself
whether these false positives and the false negatives in the analyses would
simply not matter due to the large amount of data in biology, or whether it was
ignored due to the lack of good data for algorithmic evaluation. Later, when I
read David's post on Tardigrades and phylogenetic
networks,
where he pointed to two analyses on the same data that explained them once with lateral gene transfer (Boothby et al. 2015) and
once with errors in the data (Koutsovoulos 2015),
I became aware of the strong advantage of my linguistic data, since I could
test it against written records, tracing the history of words through
centuries, thus being able to spot errors immediately when looking up a data
point.

The detail of our data in linguistics is both a blessing and a curse. It enables us to
write detailed word histories without ever having heard of tree reconciliation
methods. On the other hand, it seduces us to get lost in details, forgetting
about the bigger picture, and the bigger questions that we could ask, if this
data was properly digitized and formalized. In this regard, historical
linguistics still needs to learn from biology, as we have failed to turn
historical linguistics into a modern, data-driven discipline. With more and
more detailed data becoming available, however, the day will come when
Schleicher is proven right, and when biologists can learn from linguists about
evolution.

Enguix, G. and M. Jimenez-Lopez (2012): Natural language and the genetic code: From the semiotic analogy to biolinguistics. In: Proceedings of the 10th World Congress of the International Association for Semiotic Studies (IASS/AIS). 771-780.

Tuesday, July 12, 2016

The Tree of Knowledge is a well-known concept, and the tree can indeed be used to arrange information. One possible use is to describe the relationships of derivative products (ie. the chemical derivatives of other substances). Indeed, these can be viewed as having a "phylogeny", since the processing follows a time sequence.

The U.S. Geological Survey (in the U.S. Department of the Interior) has provided one such example in Geological Survey Circular 1143 Coal — a Complex Natural Resource. The centerfold of that publication shows:

Coal byproducts in tree form showing basic chemicals as branches and derivative substances as twigs and leaves. [Modified from an undated public domain illustration provided by the Virginia Surface Mining and Reclamation Association.]

However, a tree is a simplification of a network, and the network can thus show more information. In this case, the same information has previously been illustrated using a reticulating network, not a tree.

This has three reticulations, showing coal products produced as a result of combining two different processing routes. This is thus a hybridization network.

Thanks to the Trees of Knowledge page (by Paul Michel) of the "Encyclopedias as Indicators of Change in the Social Importance of Knowledge, Education and Information" web site, for pointing out this unexpected use of trees of knowledge.

Tuesday, July 5, 2016

It seems to me that the study of reticulate evolutionary histories currently boils down to two options:
(1) reconstructing a species "tree" from multiple gene trees using a coalescent model that includes hybridization (either homoploid or polyploid);
(2) reconciling multiple gene trees with a known [sic] species tree using a model that includes gene duplication, loss and transfer (as well as speciation) - a DTL model.

comprehensive as it includes the following evolutionary events: speciation, speciation-loss (speciation followed by a loss of one gene copy), gene duplication, gene loss, gene transfer and transfer-loss (gene transfer with loss of the original gene) between two sampled species, and gene transfer and transfer-loss from/to an unsampled species (i.e. a species that is not represented in the dataset) to/from a sampled one.

Since the model is "comprehensive", then hybridization must be included. The only parts of the model that include reticulate histories are gene transfer and transfer-loss, so this is where hybridization must be. Possibly, polyploid hybridization is included in "gene transfer" (an increase in the number of gene copies), and homoploid hybridization is included in "transfer-loss" (maintaining the same number of genes).

This seems to be a simple example of the idea that different types of reticulation events cannot be distinguished from each other. Genomic material moves from one place to another in contemporaneous organisms, either sexually (introgression, hybridization) or asexually (lateral gene transfer). There is nothing intrinsic about gene trees to tell us which mechanism is involved in any given reticulation, other than the relative positions of the donor and recipient in the "species tree" and the possibility of time inconsistency.

This leads to the question of why horizontal gene movement is called "transfer" in one model (2) and "hybridization" in the other (1).