Tuesday, January 31, 2017

There is a long-standing debate in linguistics regarding the best proof deep
relationships between languages. Scholars often break it down to the question of
words vs. rules, or lexicon vs. grammar. However, this is essentially
misleading, since it suggests that only one type of evidence could ever be
used, whereas most of the time it is the accumulation of multiple pieces of
evidence that helps to convince scholars. Even if this debate is misleading, it
is interesting, since it reflects a general problem of historical linguistics:
the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with
similarities. Languages can be strikingly similar in various ways. They can
share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding
existing ones, and the word for 'train' is expressed by combining huǒ 火
'fire' and chē 車 'wagon'. The same can be done in languages like German and
English, where the words Feuerwagen and fire wagon will be slightly
differently interpreted by the speakers, but the constructions are nevertheless
valid candidates for words in both languages. In Russian, on the other hand,
it is not possible to just put two nouns together to form a new word, but one
needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated
as 'firy wagon'.

Neither German nor English are historically closely related to
Chinese, but German, English, and Russian go back to the same relatively recent ancestral
language. We can see that whether a language allows compounding of two words to
form a new one or not, is not really indicative of its history, as is the
question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different
sources, and not all of them are due to historical development. Similarities
can be:

coincidental (simply due to chance),

natural (being grounded in human cognition),

genealogical (due to common inheritance), and

contact-induced (due to lateral transfer).

As an example for the first type of similarity, consider the Modern Greek word θεός
[θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound
similar, but this is a sheer coincidence. This becomes clear when comparing
the oldest ancestor forms of the words that are reflected in written sources,
namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002:
57f).

As an example of the second type of similarity, consider the
Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are
strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels
like [a] and the nasal consonant [m] (Jakobson 1960).

An
example of genealogical similarity is the German Zahn and the English tooth, both
going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is
reflected in the English mountain and the French montagne, since the former was
borrowed from the latter.

We can display these similarities in the following decision tree, along with
examples from the lexicon of different languages (see List 2014:
56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to
indicate that they are historical similarities. They reflect individual
language development, and allow us to investigate the evolutionary history of
languages. Natural and coincidental similarities, on the other hand, are not
indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial
to first rule out the non-historical similarities, and then the contact-induced
similarities. The non-historical similarities will only add noise to the
historical signal, and the contact-induced similarities need to be separated
from the genealogical similarities, in order to find out which languages share
a common origin and which languages have merely influenced each other some time
during their history.

Unfortunately, it is not trivial to disentangle these similarities.
Coincidence, for example, seems to be easy to handle, but it is notoriously
difficult to calculate the likelihood of chance similarities. Scholars have
tried to model the probability of chance similarities mathematically, but their
models are far too simple to provide us with good estimations, as they usually
only consider the first consonant of a word in no more than 200 words of each
language (Ringe 1992, Baxter and Manaster Ramer
2000, Kessler 2001).

The problem here is
that everything that goes beyond word-initial consonants would have to take the
probability of word structures into account. However, since languages differ
greatly regarding their so-called phonotactic structure (that is, the sound
combinations they allow to occur inside a syllable or a word), an account on
chance similarities would need to include a probabilistic model of possible and
language-specific word structures. So far, I am not aware of anybody
who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it
seems that one could capture natural similarities by searching for
similarities that recur in very diverse locations of the world. If we compare,
for example, which languages have tones, and we find that tones occur almost
all over the world, we could argue that the existence of tone languages is not
a good indicator of relatedness, since tonal systems can easily develop
independently.

The problem with independent development, however, is again
tricky, as we need to distinguish different aspects of independence.
Independent development could be due to: human cognition (the fact that many
languages all over the world denote the bark of a tree with a compound
tree-skin is obviously grounded in our perception); or due to language acquisition
(like the case of words for 'mother'); but potentially also due to
environmental factors, such as the size of the population of speakers
(Lupyan et al. 2010), or the location where the languages
are spoken (see Everett et al. 2015, but also compare the
critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to
denote similar development due to contact) is a very frequent phenomenon in
language evolution, and can happen in all domains of language. Often we simply
do not know enough to make a qualified assessment as to whether certain
features that are similar among languages are inherited/borrowed or have
developed independently.

Interestingly, this was first emphasized by Karl
Brugmann (1849-1919), who is often credited as the "father of cladistic
thinking" in linguistics. Linguists usually quote his paper from
1884, in order to emphasize the crucial role that Brugmann
attributed to shared innovations (synapomorphies in the cladistic
terminology) for the purpose of subgrouping. When reading this paper
thoroughly, however, it is obvious that Brugmann himself was much less obsessed
with the obscure and circular notion of shared innovations (which also
holds for cladistics in biology; see De Laet 2005), but
with the fact that it is often impossible to actually find them, due to our
incapacity to disentangle independent development, inheritance and
borrowing.

So far, most linguistic research has concentrated on the problem of
distinguishing borrowed from inherited traits, and it is here that the fight
over lexicon or grammar as primary evidence for relatedness primarily developed.
Since certain aspects of grammar, like case inflection, are rarely transferred
from one language to another, while words are easily borrowed, some linguists
claim that only grammatical similarities are sufficient evidence of language
relationship. This argument is not necessarily productive, since many languages
simply lack grammatical structures like inflection, and will therefore not be
amenable to any investigation, if we only accept inflectional morphology
(grammar) as rigorous proof (for a full discussion, see Dybo and Starostin
2008). Luckily, we do not need to go that far.
Aikhenvald (2007: 5) proposes
the following borrowability scale:

Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second,
right behind inflectional morphology. Pragmatically, we can thus say: if we
have nothing but the words, it is better to compare words than anything else.
Even more important is that, even if we compare what people label "grammar",
we compare concrete form-meaning pairs (e.g., concrete plural-endings), and
we never compare abstract features (e.g., whether languages have an article).
We do so in order to avoid the "homoplasy problem" that causes so many
headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their
lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the
problem of similarity is still creating a lot of confusion in the
interdisciplinary dialogues involving linguistics and biology. David is right:
similarity between linguistic traits is more like similarity in morphological
traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype)
(Morrison 2014).

Second, the problem of disentangling
different kinds of similarities is not unique to linguistics, but is also present
in biology (Gordon and Notar 2015), and comparing the
problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null
hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a
phylogeny?"
When dealing with observed similarity patterns across different languages, and
recalling that we do not have the luxury to assume monogenesis in language
evolution,
we might want to know what the null hypothesis for these data should be. I have
to admit, however, that I really don't know the answer.References

Wednesday, January 25, 2017

Ruben E. Valas and Philip E Bourne (2010. Save the tree of life or get lost in the woods. Biology Direct 5: 44) have an interesting discussion of the relationship between the Tree of Life and the Web of Life. They argue that:

Function follows more of a tree-like structure than genetic material, even in the presence of horizontal transfer ... We propose a duality where we must consider variation of genetic
material in terms of networks and selection of cellular function in
terms of trees. Otherwise one gets lost in the woods of neutral
evolution.

As an aside, they also note:

We must keep in mind the humor of calling the central metaphor for evolution "the tree of life". The phrase first appears in Genesis 2:9 ... There is irony in using the name of a tree central to the creation story to argue against that very myth.

There is clearly a duality in Darwin's theory of descent with modification: the history of variation is well described by a network and the history of selection is well described by a tree.

Tuesday, January 17, 2017

As noted in the previous blog post (Why do we need Bayesian phylogenetic information content?), phylogeneticists rarely consider whether their data actually contain much phylogenetic information. Nevertheless, the existence of information content in a dataset implies the existence of null hypothesis of "no information", relative to the objective of the data analysis.

In this regard, Alexander Suh (2016), in a paper on the phylogenetics of birds, makes two important general points:

Every phylogenetic tree hypothesis should be accompanied by a phylogenetic network for visualization of conflicts.

Hard polytomies exist in nature and should be treated as the null hypothesis in the absence of reproducible tree topologies.

It is difficult to argue with the first point, of course. However, the second point is also an interesting one, and deserves some consideration. Suh notes that: "In contrast to ‘soft polytomies’ that result from insufficient data, ‘hard polytomies’ reflect the biological limit of phylogenetic resolution because of near-simultaneous speciation". That is, the distinction is whether polytomies result from simultaneous branching events (hard) or from insufficient sequence information (soft).

The matter of a suitable null hypothesis in phylogenetics has been considered before, for example by Hoelzer and Meinick (1994) and Walsh et al. (1999), who come to essentially the same conclusion as Suh (2016). Clearly, a network cannot be the null hypothesis for a phylogeny, and
nor can a resolved tree (even partially resolved); the only logical
possibility is a polytomy.

However, it seems to me that the current null hypothesis is effectively a soft polytomy, although no hypothesis is ever explicitly stated by most workers. Nevertheless, any evidence to resolve polytomies seems to be accepted, with evidence taken in descending order of strength in order to resolve any conflicting evidence. This inevitably produces a tree that is at least partly resolved, which is the alternative hypothesis.

On the other hand, resolving a hard polytomy requires unambiguous evidence for each branch in the phylogeny. If there is substantial conflict then it can only be resolved as a reticulation, or it must remain a polytomy. The existence of a reticulation, of course, results in a network, not a tree, so that the alternative hypothesis is a network, which may in practice be very tree-like.

As a final point, Suh claims that: "Neoaves comprise, to my knowledge, the first empirical example for a hard polytomy in animals." This is incorrect. There is also a hard polytomy at the root of the Placental Mammals, as discussed in this blog post: Why are there conflicting placental roots?

Tuesday, January 10, 2017

There are many ways to construct a phylogenetic tree, and after we have done so we are usually expected to indicate something about "branch support", such as bootstrap values or bayesian posterior probabilities. Rarely, however, do people indicate whether there is much tree-like phylogenetic information in their dataset in the first place — it is simply assumed that there must be (fingers crossed, touch wood).

Recently, this latter issue has been addressed for bayesian analysis by:

They develop a methodology for "measuring information about tree topology using marginal posterior distributions of tree topologies", and apply it to two small empirical datasets. That is, we can now work out something about "[substitution] saturation and detecting conflict among
data partitions that can negatively affect analyses of concatenated data."

However, we have long been able to do this with data-display phylogenetic networks. More to the point, we can do it in a second or two, without ever constructing a tree. More pedantically, if the network construction produces a tree, then we know there is tree-like phylogenetic information in the dataset; if we get a network then there is little such information. Equally importantly, the network might tell us something about the patterns of non-tree-likeness, which a single-number measurement cannot.

Let's take the first empirical dataset, as described by the authors:

The five sequences of rpsll composing the data set BLOODROOT [three taxa from the angiosperm family Papaveraceae and two monocots] ... were chosen because they represent a case in which horizontal transfer of half of the gene results in different true tree topologies for the 5′ (219 nucleotide sites) and 3′ (237 nucleotide sites) subsets, which allows investigation of information content estimation in the presence of true conflicting phylogenetic signal. We analyzed each half of the data separately and measured phylogenetic dissonance, which is expected to be high in this case.

Here is the NeighborNet based on uncorrected distances. The idea that there is something non-tree-like about Sanguinaria seems hard to avoid. Indeed, the network pattern makes recombination an obvious first choice, with part of the sequence matching the Papaveraceae (on the left) and part matching the monocots (on the right). This recombination may be due to HGT.

Now for the second dataset:

The data set ALGAE comprises chloroplast psaB sequences from 33 taxa of green algae (phylum Chlorophyta, class Chlorophyceae, order Sphaeropleales) ... The alignments of just the psaB gene ... were chosen because of their deep divergence, which invites hasty judgements of saturation, especially of third codon position sites. We analyzed second and third codon position sites separately ... to assess which subset has more phylogenetic information.

Here are the two NeighborNets based on uncorrected distances. Once again, it is immediately obvious that the third-codon positions have almost no information at all, even for a network, let alone a tree — the terminal branches do not connect in any coherent way. The second-codon positions do have some information, but it is so contradictory that one could not construct a reliable tree. Saturation of nucleotide substitutions is a likely candidate for this situation; and some correction for this saturation would be needed even to construct a reasonable network from these data.

Tuesday, January 3, 2017

Google Trends looks at recent trends in web searches, and it has been used to study patterns in web activity for many concepts. This is similar to The Ngram Viewer in Google Books (see the post Ngrams and phylogenetics). Google Trends aggregates the number of web searches that have been performed for any given search term (or terms), and it can display the results as a time graph, for any given geographical region. The Trends searches are somewhat restrictive, but they may show us something about the period 2004-2016 (inclusive).

So, I thought that it might be interesting to look at a few expressions of relevance to readers of this blog. The Trends graphs show changes in the relative proportion of searches
for the given term (vertically) through time (horizontally). The
vertical axis is scaled so that 100 is simply the time with the most
popularity as a fraction of the total number of searches (ie. the scale
shows the proportion of searches, with the maximum always shown as 100,
no matter how many searches there were).

As you can see, the term "phylogenetics" has maintained its popularity over "historical linguistics". However, it has decreased in popularity through time much more than has "historical linguistics". Nevertheless, both decreases are very small compared to that for the term "bioinformatics", as discussed in the blog post on Bioinformaticians look at bioinformatics.

It is not necessarily clear to me why many technical terms have decreased in Google searches through time, although there are several possibilities. First, it could be Google itself. The Trends numbers represent search volume for a keyword relative to the total search volume on Google. So, actual search numbers for the technical terms could be increasing while as a fraction of total search volume of the internet they are decreasing, if total Google search volume is increasing.

Alternatively, Business Insider has noted that "search is facing a huge challenge ... consumers are increasingly shifting [from desktop] to mobile. On mobile, consumers say they just don't search as much as they used to because they have apps that cater to their specific needs. They might still perform searches within those apps, but they're not doing as many searches on traditional search engines". Furthermore, "people are discovering content through social media. The top eight social networks drove more than 30% of traffic to sites in 2014".

The extra raggedness in search popularity in the first couple of years of the graph probably reflects inadequacies in the Google Trends dataset in the early years (as discussed by Wikipedia). The same is true for the next graph, as well.

The "phylogenetic tree" searches have been more popular than "evolutionary tree", just as was true for the Google Books usage discussed in the post Ngrams and phylogenetics. However, the "phylogenetic tree" searches show a distinctly bimodal pattern every year. This presumably reflects teaching semesters — few people search for technical terms out of term time!

Unfortunately, it is not possible to look at the term "phylogenenetic network", because Google Trends tells me that there is "Not enough search volume to show results". How rude!