Tuesday, June 27, 2017

In historical linguistics, "linguistic reconstruction" is a rather
important task.
It can be divided into several subtasks, like "lexical reconstruction",
"phonological reconstruction", and "syntactic reconstruction" — it comes
conceptually close to what biologists would call "ancestral state
reconstruction".

In phonological reconstruction, linguists seek to reconstruct the sound
system of the ancestral language or proto-language, the Ursprache that is no longer attested in written sources.
The term lexical reconstruction is less frequently used, but it
obviously points to the reconstruction of whole lexemes in the
proto-language, and requires sub-tasks, like semantic reconstruction
where one seeks to identify the original meaning of the ancestral word
form from which a given set of cognate words in the descendant languages
developed, or morphological reconstruction, where one tries to reconstruct the morphology, such as case systems, or frequently recurring suffixes.

In a narrow sense, linguistic reconstruction only points to
phonological reconstruction, which is something like the holy grail of
computational approaches, since, so far, no method has been proposed that would convincingly show that one can do without expert insights.
Bouchard-Côté et al. (2013)
use language phylogenies to climb a language tree from the leaves to
the root, using sophisticated machine-learning techniques to infer the
ancestral states of words in Oceanic languages. Hruschka et al. (2015)
start from sites in multiple alignments of cognate sets of Turkish
languages to infer both a language tree, as well as the ancestral states
along with the sound changes that regularly occurred at the internal
nodes of the tree. Both approaches show that phylogenetic methods
could, in principle, be used to automatically infer which sounds were used
in the proto-language; and both approaches report rather promising
results.

None of the approaches, however, is finally convincing, both for
practical and methodological reasons. First, they are applied to
language families that are considered to be rather "easy" to
reconstruct. The tough cases are larger language families with more
complex phonology, like Sino-Tibetan or any of its subbranches,
including even shallow families like Sinitic (Chinese), or
Indo-European, where the greatest achievements of the classical methods
for language comparison have been made.

Second, they rely on a wrong
assumption, that the sounds used in a set of attested languages
are necessarily the pool of sounds that would also be the best
candidates for the Ursprache. For example, Saussure (1879) proposed that Proto-Indo-European had at least two sounds that did not survive in any of the descendant languages,
the so-called laryngeals, which are nowadays commonly represented as h₁, h₂, and h₃,
and which leave complex traits in the vocalism and the consonant
systems of some Indo-European languages. Ever since then, it has been a standard assumption
that it is always possible that none of the ancestral sounds in a given
proto-language is still attested in any its descendants.

A third interesting point, which I consider a methodological problem of the methods, is that both of them are based on language trees, which are either given to the algorithm or
inferred during the process.
Given that most if not all approaches to ancestral state reconstruction
in biology are based on some kind of phylogeny, even if it is a rooted
evolutionary network, it may sound strange that I criticize this point.
But in fact, when linguists use the classical methods to infer ancestral
sounds and ancestral sound systems, phylogenies do not necessarily play
an important role.

The reason for this lies in the highly directional nature of sound
change, especially in the consonant systems of languages, which
often makes it extremely easy to predict the ancestral sound without invoking
any phylogeny more complex than a star tree. That is, in linguistics we often have a good idea about directed character-state changes. For example, if a linguist observers a [k] in one set of languages and a [ts] in another languages in the same alignment site of multiple cognate sets, then they will immediately reconstruct a *k for the proto-language, since they know that [k] can easily become [ts]
but not vice versa. The same holds for many sound correspondence
patterns that can be frequently observed among all languages of the
world, including cases like [p] and [f], [k] and [x],
and many more. Why should we bother
about any phylogeny in the background, if we already know that it is
much more likely that these changes occurred independently? Directed character-state assessments make a phylogeny unnecessary.

Sound change in this sense is simply not well treated in any paradigm
that assumes some kind of parsimony, as it simply occurs too often
independently. The question is less acute with vowels, where scholars
have observed cycles of change in ancient languages that are attested
in written sources. Even more problematic is the change of tones, where
scholars have even less intuition regarding preference directions or
preference transitions; and also because ancient data does not describe the
tones in the phonetic detail we would need in order to compare it with
modern data.
In contrast to consonant reconstruction, where we can do almost
exclusively without phylogenies, phylogenies may indeed provide some
help to shed light on open questions in vowel and tone change.

But one
should not underestimate this task, given the systemic pressure that may
crucially impact on vowel and tone systems. Since there are
considerably fewer empty spots in the vowel and tone space of human
languages, it can easily happen that the most natural paths of vowel or
tone development (if they exist in the end) are counteracted by systemic
pressures. Vowels can be more easily confused in communication, and
this holds even more for tones. Even if changes are "natural", they
could create conflict in communication, if they produce very similar
vowels or tones that are hard to distinguish by the speakers. As a
result, these changes could provoke mergers in sounds, with speakers no
longer distinguishing them at all; or alternatively, changes that are
less "natural" (physiologically or acoustically) could be preferred by a
speech society in order to maintain the effectiveness of the linguistic
system.

In principle, these phenomena are well-known to trained linguists,
although it is hard to find any explicit statements in the literature. Surprisingly, linguistic reconstruction (in the sense of phonological
reconstruction) is hard for machines, since it is easy for trained
linguists. Every historical linguist has a catalogue of existing sounds
in their head as well as a network of preference transitions, but we
lack a machine-readable version of those catalogues. This is mainly because
transcriptions systems widely differ across subfields and families, and
since no efforts to standardize these transcriptions have been
successful so far.

Without such catalogues, however, any efforts to
apply vanilla-style methods for ancestral state reconstruction from
biology to linguistic reconstruction in historical linguistics, will be
futile. We do not need the trees for linguistic reconstruction, but the
network of potential pathways of sound change.

Tuesday, June 20, 2017

Lake Malawi, in south-eastern Africa, is famous for its large diversity of cichlid fishes. Indeed, it sometimes seems to have more biologists studying these fish than there are actual fish in the lake, even though there are allegedly hundreds of cichlid fish species in that lake. In this sense, it is somewhat similar to Lake Baikal, in southern Siberia, home to the sole species of freshwater seals.

The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.

Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:

We characterize [the] genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times.

The last sentence seems to be somewhat disingenuous. How could a single tree be expected to describe this scale of biodiversity? Any rapid radiation of diversity is unlikely to be completely tree-like. The increase in diversity can be modeled as a tree, sure, but it is very unlikely that there will be instant separation of the taxa, and so the tree model will be ignoring a large part of the evolutionary action. There will, for example, be ongoing introgression between the diverging taxa, as well as hybridization due to incomplete breeding barriers. These avenues for gene flow can best be modeled as a network, not a tree.

The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".

For data analysis, they proceed as follows:

To obtain a first estimate of between-species relationships we divided the genome into 2543 non-overlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for each window, obtaining trees with 2542 different topologies.

So, only two sequence blocks produced the same tree, presumably by random chance. An example "tree" for 12 OTUs is shown in the diagram. It superimposes a possible mitochondrial trees on a summary of the "genome tree".

The authors continue:

The fact that we are using over 25 million variable sites suggests these differences are not due to sampling noise, but reflect conflicting biological signals in the data. For example, gene flow after the initial separation of species can distort the overall phylogeny and lead to intermediate placement of admixed taxa in the tree topology.

Note that gene flow is seen to "distort" the phylogeny rather than being an integral part of it. In this case, "phylogeny" apparently refers solely to the diversification part evolutionary history, rather than to the whole history.

The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.

The authors describe their work, on the same fish group but in a lake further north-west, as follows:

Because of the rapid lineage formation in these groups, and occasional gene flow between the participating species, it is often difficult to reconstruct the phylogenetic history of species that underwent an adaptive radiation. In this study, we present a novel approach for species-tree estimation in rapidly diversifying lineages, where introgression is known to occur, and apply it to a multimarker data set containing up to 16 specimens per species for a set of 45 species of East African cichlid fishes (522 individuals in total), with a main focus on the cichlid species flock of Lake Tanganyika. We first identified, using age distributions of most recent common ancestors in individual gene trees, those lineages in our data set that show strong signatures of past introgression ... We then applied the multispecies coalescent model to estimate the species tree of Lake Tanganyika cichlids, but excluded the lineages involved in these introgression events, as the multispecies coalescent model does not incorporate introgression. This resulted in a robust species tree.

Tuesday, June 13, 2017

Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:

Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

Tuesday, June 6, 2017

It has traditionally been assumed that speciation occurs when gene flow between populations ceases. However, nothing in biology ever remains simple — the more we study any biological phenomenon the more complex it becomes. So, speciation with gene flow is becoming a more commonly discussed topic. This is especially so with the advent of genome sequencing, which allows us to study the extent of gene flow in the past, rather than solely in the present.

This paper considers the evolutionary relationships among seven species of bears, with multiple genome samples from four of those species. The coalescent species tree (based on 18,621 genome fragments > 25 kb), which accounts for incomplete lineage sorting (ILS), is well supported, as shown here.

However, numerous individual genome-fragment trees support alternative topologies. For example, 38% of the trees support a topology where the Asiatic black bear is the sister to the American black - Brown - Polar bear clade. This suggests that there is more than simply ILS that creates the conflicting genome trees.

The authors applied several different data analyses to investigate the possibility of gene flow among the species. They found considerable evidence for gene flow, as shown in the network (the arrow colors represent different analyses).

Indeed, each of the six in-group species could conceivably be connected by gene flow to each of the other five species. The network shows evidence that the Brown, Asiatic and Sloth bears might have all five connections, while the Polar and Sun bears have four, and the American bear has three.

As the authors note, some of this potential gene flow cannot have occurred directly between species, because they live in different habitats. Instead, it may be remnants of ancestral gene flow, or gene flow through a vector species. In particular, the strongest signal of gene flow connects the Asiatic black bear with the ancestor of the American black - Brown - Polar bear clade.

Ancestral gene flow is of considerable importance when studying evolution. Charles Darwin was perhaps the first to note (in his notebooks) that we should always treat ancestors as species not as taxonomic groups, no matter how big the groups of descendants now are. Whole kingdoms and phyla were once a single species, if the contemporary groups are monophyletic