Tuesday, March 28, 2017

Alignments have been discussed quite a few times
in this blog. They are so extremely common in molecular biology that I
doubt that there are any debates about their usefulness, apart from
certain attempts to improve the modelling, especially in cases of
non-colinear patterns (Kehr et al. 2014), or to speed up computation (Mathura and Adlakha 2016).
In linguistics, on the other hand, alignments are rarely used, although
initial attempts to arrange homologous words in a matrix go back to the
early 20th century, as you can see from this example taken from Dixon and Koerber (1919: 61):

Early alignment from Dixon and Kroeber (1919)

This example is rather difficult to read for those not familiar with
the annotation. The authors group homologous words across different indigenous languages from California. The group labels of the languages
under investigation are given in abbreviated form at the very left of
the matrix, and the actual varieties are listed in the next column. What
follows is the actual alignment, along with comments in the last column. Regarding the alignments, the authors note on page 55:

A number of sets
of cognates have been taken from their numbered place in this list
and put at the end to allow of their being printed in columnar form,
with a view to bringing out parallelisms that otherwise might fail to
impress without detailed analysis and discussion. (Dixon and Kroeber 1919: 55)

In my opinion, this expresses nicely why alignments should
be used more often in linguistics — due to the
problem that our "alphabets" (the sound systems of languages) are
undergoing constant change (see this
earlier post for details regarding this claim), we need to infer both
the scoring function between different sounds across different
languages, and the alignment at the same time. If we look at the
similarities the authors spotted, it should become obvious what I mean.

I am not yet sure how to interpret the data exactly, but if I am not
mistaken, the authors claim that each of the column contains
homologous material. So, they find a similarity between kaha in the first row (the language is Northern Wintun, according to the key to abbreviations in the book), and tu
in the last row (Monterey Costanoan). The last column shows suffixes,
which I think the authors exclude from their analysis, but I could not
find additional information confirming this in their book.

The comment
column illustrates another problem of representation, namely that the
authors do not know how to handle cases of metathesis (or
transpositions) consistently. The transposition of the parts of words is
a process that is quite frequent in language evolution. It is very
frequent in compounds consisting of modifier and modified, such as milk coffee in English, where milk modifies the coffee, while French, for example, puts the modifier after the main noun, expressing this as café au lait.

Nowadays, we can handle these cases consistently in linguistics, both
in our data annotation and in the alignments, and we can even search for
the structures automatically (see List et al. 2016). One hundred years ago, when Dixon and Kroeber worked out their comparison
of the languages in California, they were pioneers who tried to increase
the transparency of our discipline, and it is clear that their
solutions are not completely satisfying from today's perspective.

It is extremely surprising for me that, despite these early attempts to make our
homology judgments in linguistics more transparent, the practice of phonetic
alignments is still rarely used by historical linguists. Indeed, the majority of
them even think that it is a waste of time, or only useful for the purpose of
teaching.

I was reminded of this when I looked at a recent proposal by Bengtson (2017, see also this blog
for details) for deep genetic connections between Basque and North
Caucasian languages. Note that the Basque language is traditionally
considered as an isolate, i.e. a language whose nearest relatives
we cannot find among the languages in the world. Many linguists have
attempted to solve this puzzle by proposing various hypotheses (see Forni 2013
for an example of attempting to link Basque with Indo-European).
Bengtson proposes various types of evidence, which I cannot really
judge, as I do not know the languages under comparison, but finally, he
also shows a list with potential homologs between Basque and North Caucasian
varieties, which you find below.

Potential homologs between Basque and North Caucasian languages (Bengtson 2017)

If you are not a trained historical linguistic, and thus do not know what
to do with this table, be assured that many historical linguists will
feel similarly. As a rough explanation: the concepts are supposed to be
very, very stable, being drawn from Sergey Yakhontov's list of 35 ultra-stable concepts,
and I think that all words in one row are supposed to be etymologically
related — that is, they should be potential homologs across all of the
languages. If word forms are preceded by the asterisk symbol (*),
this means that they are reconstructed, i.e. not reflected in written
sources. But that is all I can tell you for the moment. Where I
should start the comparison between the words remains a mystery for me,
as I do not know which parts are supposed to be similar. Alignments
would help us to see immediately where the author thinks that
the historical similarities can be found — that is, we would see, which parts of the
words are supposed to be homologous.

At this point in the post, I originally planned to provide you with an alignment of
Bengtson's table, in order to illustrate the benefits of alignment in
linguistics. Unfortunately, I had to admit to myself that I cannot do this, as I
simply do not know where to align the words (apart from some rare trivial
cases in the table).

I really hope that this will change in the future. Too
often, our hypotheses in linguistics suffer from insufficient transparency with
regards to the "proofs" and the evidence. I agree that it is very difficult to
come up with good alignments in linguistics, especially if one regards cases of
metathesis, unrelated parts, and general uncertainty. However, instead of giving in
to the problem, we should follow the pioneering work of Dixon and Kroeber, and
try to improve the way we present our data to both our colleagues and a broader
public.

Theories such as the link between Basque and the North Caucasian languages are
usually highly disputed in historical linguistics, and I do not know of any
long range proposal that has gained broad acceptance during the last 50 years.
Yet, maybe this is not because the proposals are not valid,
but simply because those who are proposing these theories have failed to present their
findings in a transparent and testable way.

Tuesday, March 21, 2017

I have written before about the Phylogenetics of computer viruses. This is an example of the use of phylogenetics as a metaphor for the history of non-biological objects. By analogy, computer viruses and other malware can be seen to be phylogenetically related, because new viruses are usually generated using existing malicious computer code — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the analogy is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Furthermore, the model of historical change in computer viruses is often the same as that for biological viruses — recombination rather than substitution. That is, like real viruses, new computer viruses are often created by recombining chunks of functional information from pre-existing viruses, rather than by an accumulation of small changes. Coherent subsets of the current computer code are combined to form the new programs.

From this perspective, it is unexpected that the principal phylogenetic model in the study of computer viruses has been a tree rather than a network — a recombinational history requires a network representation, not a tree, and thus malware evolution is not tree-like. As noted by Liu et al. (2016): "Although tree-based models are the mainstream direction, they are not suited to represent the reticulation events which have happened in malware generation."

In my previous (2014) post, I noted only two known papers that used a network rather than a tree to represent malware evolution:

Goldberg et al. (1996) analyzed their data using what they call a phyloDAG, which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network; described in more detail in Phylogenetics of computer viruses);

Khoo & Lió (2011) used splits graphs rather than unrooted trees to display their data, although they did not specify the algorithm for producing their networks.

Unfortunately, malware researchers have continued to pursue the idea that a phylogeny is simply a form of classification, and have therefore stuck to the idea of producing a tree-like phylogeny using some form of hierarchical agglomerative clustering algorithm (eg. Bernardi et al. 2016).

More positively, however, some papers have appeared that have instead pursued the idea of using a network model rather than a tree:

Liu et al. (2016) provided median-joining networks, which are unrooted splits graphs, to display relationships within each of three different virus groups;

Jang et al. (2013) infered a directed acyclic graph using a minimum spanning tree algorithm, with a post-processing step to allow nodes to have multiple parents;

Anderson et al. (2014) presented a novel algorithm based on a graphical lasso, which builds the phylogeny as an undirected graph, to which directionality is then added using a post-hoc heuristic;

Oyen et al. (2016) "present a novel Bayesian network discovery algorithm for learning a DAG [directed acyclic graph] via statistical inference of conditional dependencies from observed data with an informative prior on the partial ordering of variables. Our approach leverages the information on edge direction that a human can provide and the edge presence inference which data can provide."

It is important to note that only the works producing a directed graphs can represent a phylogeny — the other works produce unrooted graphs that may or may not reflect phylogenetic history. The bayesian work of Oyen et al. (2016) is particularly interesting:

Directionality is inferred by the learning process, but in many cases it is difficult to infer, therefore prior information is included about the edge directions, either from human experts or a simple heuristic. This paper introduces a novel approach to combining human knowledge about the ordering of variables into a statistical learning algorithm for Bayesian structure discovery. The learning algorithm with our prior combines the complementary benefits of using statistical data to infer dependencies while leveraging human knowledge about the direction of dependencies.

Tuesday, March 14, 2017

There has been considerable interest in recent years in developing methods that will detect hybridization in the presence of incomplete lineage sorting (ILS), which will allow the construction of a realistic hybridization network. Clearly, both ILS and hybridization create conflicting gene trees, which will lead to a very complex data-display network. However, if the ILS signals in the data can be used to construct a small collection of gene-tree groups, in which the gene trees within each group are congruent with a single species tree (under the ILS model), then the incongruence between groups can be used to construct a hybridization network. This network will then be an hypothesis for a realistic evolutionary network.

Recently, a paper has appeared that uses simulations to evaluate several of these methods:

I am not a great fan of
simulations, because they exist under very restricted and usually unrealistic mathematical conditions. They are, however, useful for exploring the mathematical properties of various methods, even if they are hard to connect to the biological properties.

My interpretations of the results from the particular scenarios explored by Kamneva and Rosenberg are:

Most of the methods improve as the internal network edges increase in length.

Most of the methods improve as the number of gene trees increases.

Under good conditions the maximum-likelihood methods do better than the parsimony and consensus methods.

The maximum-likelihood methods are more affected by gene-tree error than are the other methods.

There are conditions under which none of the methods work well.

I doubt that any of this is controversial, in the sense that
model-based methods usually work well when their models apply, but not
necessarily otherwise. Reality is more complex than the models, and so the methods
are likely to fail for real data.

For me, the most interesting part of the paper is the examination of balanced versus skewed
parental contributions to the hybrid taxon. A balanced genetic contribution in the simulations is analogous to homoploid or polyploid
hybridization, whereas a skewed contribution is analogous
to introgression or horizontal gene transfer (HGT). The simulations seem to show that
the methods examined do not deal very well with skewed contributions.

So, these methods may literally be hybridization-network methods only, with separate network methods needed
for detecting introgression or HGT — for example, the admixture methods used for genomes (see the recent post on Producing admixture graphs).

This would mean that we cannot first produce networks with reticulations, and then afterwards explore what is causing the reticulations. Instead, we will need to decide on the possible biological mechanisms of reticulation before the analysis, and then mathematically explore possible networks that reflect those mechanisms.

This is not an issue for constructing trees, of course, since the only recognized mechanisms are speciation and extinction, both of which are explored post hoc rather than a priori. This is an important difference of networks versus trees.

Most of the early representations of pedigrees had the people's names enclosed in a circle, called a "roundel", and it was these roundels that were connected to show the family relationships. One of the steps on the way to a tree was thus dropping this idea, so that the names could be connected directly.

Interestingly, the earliest pedigrees that do not have roundels also date from this early period. As noted by Nathaniel Lane Taylor, the importance of this development is that: "the scribe relies on the power of the names themselves to anchor a diagram on the page, with lines simply taking the place of any syntax needed to describe the filiation." That is, no abstract iconography is needed.

Taylor provides links to illustrations of the next known example:
c. 1128, John of Worcester, Chronicle of World and English History (Corpus Christi College MS 157).
This book contains eight genealogies of Anglo-Saxon and Norman kings (pp. 47-54), one of which is shown above.

Taylor also refers to "one of the Arabic stemmata" illustrated in:
Arthur Watson (1934) The Early Iconography of the Tree of Jesse. Oxford University Press,
I have not seen this book, but the illustrations are apparently confined to those from the 12th century, making the diagram contemporaneous with the two listed above. The Tree of Jesse normally appears in Medieval Christian art as a richly illustrated genealogy of Jesus in illuminated manuscripts, but apparently this one was an exception.