Tuesday, June 21, 2016

Alignments and phylogenetic reconstruction in linguistics and biology

In a very interesting article from 2009 (Morrison 2009), David discusses
the question of why phylogeneticists would "ignore computerized sequence
alignment". This article was really interesting to me for two reasons:
First, the article provides some interesting statistics regarding the degree
to which biologists manually adjust the alignments that were automatically
produced by software. Second, the article points to the seemingly strange
situation in biology in which tree-building is considered to be a task that can
be entirely carried out by machines, while the majority of scholars would not
trust their final sequence alignments to a computer (Morrison 2009: 150).

One major difference between biology and linguistics is the selection of
comparanda. Biological methods usually derive phylogenetic trees from
multiply aligned sequences. Linguistic methods derive trees from sets of
homologous (cognate) words (cognate sets) distributed across languages
whose evolution is modeled as a process of word-gain and word loss (similar to
gene-family gain-loss-studies in biology). While biologists fiddle with their
alignments, linguists fiddle with their cognate sets. Cognate identification is
exclusively done manually at the moment, and scholars use all kinds of
information about word relations that they can get, be it etymological
dictionaries, which have been published for more than 200 years, or the
intuition of the expert who is annotating the data for cognacy.

Identification of cognate sets in linguistics is essentially a task of
sequence comparison (List 2014), and algorithmic as well as
manual procedures involve the multiple and the pairwise alignment of words
(even if it is done only implicitly by human experts). Compared to biology,
sequence comparison in historical linguistics is exacerbated by two factors:

alphabets (phoneme systems) in linguistics are themselves mutable (Geisler and List 2013),
so that when aligning two words we need to find both a mapping between the
two alphabets, translating one alphabet into the other, plus a scoring
function by which we can score the alignment,

regular sound change (the process by which the phoneme system is
changed) and sporadic sound change (the process by which a sound is
sporadically assimilated, lost, or added) are not the only processes
that contribute to change of words in the lexicon, and morphological change
(by which whole blocks of meaningful parts of a word are re-arranged,
exchanged, lost, or added) yields patterns that are essentially unalignable.

The problem of finding the correct mapping between two alphabets in linguistics
is further exacerbated by language contact: If languages exchange words on a
large scale, then this may have a huge impact on the system of the languages, and it
may even introduce new sounds to a language that were not there before (thanks
to English, German has now the sound [dʒ], as in journalist or job). If
borrowing is frequent enough, it may get close to impossible to judge from
comparing the words alone, whether two words in different languages have been transferred directly (vertically) from an ancestral language, or laterally.

As a result, it is probably understandable why linguists often
refuse to carry out full alignments of the words in their data. An
alignment itself does not necessarily tell us much, compared to all of those processes that an expert infers when comparing language data,
which are not alignable.

As an example, let us consider the word for "sun" in six Indo-European
languages. Since "sun" is a very basic concept, probably fundamental for all
human cultures, experts assume that this word was present as *séh₂u̯el- in
Indo-European (an asterisk indicates that the word is not reflected in written
sources), and that it was retained as Russian солнце [sɔnʦə], Polish słońce
[swɔnjʦɛ], French soleil [sɔlɛj], Italian sole [sole], German Sonne
[sɔnə], and Swedish sol [suːl] (Wodtko et al. 2008). An
obvious alignment, reflecting the surface similarity between all of these words, would be the following one (taken from List 2014: 135):

Alignment based on sequence similarity.

This alignment, however, is by no means correct. Russian [sɔnʦə] and Polish
[swɔnʲʦɛ], for example, share a common suffix, which is reflected as [nʦə] in
Russian and as [nʲʦɛ] in Polish, and which was innovated in the the common
ancestor of Russian and Polish, but is not present in either of the four other
languages. So the [n] in German [sɔnə] is essentially not homologous with the
[n] in Russian or the [nʲ] in Polish. The same applies to the [ɛj] in French
[sɔlɛj] which reflects a diminutive suffix in Latin sol-iculus "small sun",
the regular ancestor form of French soleil. Furthermore, the [w] in the
Polish word regularly corresponds to the [l] in French, Italian, and Swedish,
but it reflects a swap (metathesis) in the order of the vowel and the consonant
in Polish — [sɔl] became [slɔ] which became [swɔ]).

Taking all (and more) of this into
account, we need to modify our alignment to account more closely for the
processes that experts have inferred from intensive language comparison, as
shown in the next figure below (taken from List 2014: 135). In this alignment, the swap in Polish is reflected by the white font of the sounds involved, and gray-shaded columns are supposed to reflect the oldest layer of homology.

Historically informed alignment.

However, even this alignment is essentially misleading. The Indo-European word for
"sun" supposedly had a complex paradigm in which the word's stem was
alternating in the nominative (and accusative) case and the other cases (oblique
cases). So, nominative and accusative used the stem *sóh₂u̯el-, while the
other cases used the stem *sh₂én-. The Russian, Polish, French,
Italian, and the Swedish form go back to the former, while the German form goes back
to the latter, since it is further assumed (or it can be assumed) that the
alternation was still preserved in the ancestor of Swedish and German.

This means, however, that our alignment above shrinks to an alignment in which only
the first letter, the s, is still reflected in all languages! The following
graphic (taken from List 2016) illustrates the
processes that led to the current situation for four of our six languages:

Morphological processes of lexical change.

What does this example tell us? On the one hand, it gives some explanation for why
linguists do not really want to align words (although the first alignments go back
to the early 20th centur, cf. Dixon and Kroeber 1919). It
also explains, why classical linguists have a very sceptical attitude towards
the computerization of word comparisons, based on the (partially justified)
assumption that computers could not handle the complex patterns that are so
characteristic of language change.

On the other hand, comparing the situation
with biology as reported in Morrison (2009), we can find an interesting
parallel between the two disciplines: both linguists and biologists do not
really trust machines for comparing their sequences (albeit at different
levels of analysis), but they do not seem to have many problems in trusting
machines to reconstruct their trees.

However, especially this last point, the fact
that we trust machines to grow our trees, while we distrust them to prepare the
seeds, should ring an alarm bell. First, we seem to lack clear guidelines
(at least in linguistics) regarding the way the manual adjustment (of alignments in biology and cognate
sets in linguistics) should be carried out, which has a clear impact on
repeatability. Second, if we have processes in both fields that yield
essentially unalignable patterns, such as duplications and other molecular
processes in biology (Morrison 2009: 156), and morphological processes in
linguistics, how can we assume that a phylogenetic tree analysis can
sufficiently cope with them, even if we manually adjust everything?