Monday, June 29, 2015

I have noted before that common usage of expressions like "family tree" often extend far beyond actual pedigrees. This particular expression is often used to describe any sort of historical relationship, not just genealogical ones. It is also sometimes used simply to describe any sort of personal inter-connection. All of these usages occurred in a short-lived magazine from 25 years ago called Wigwag.

Wigwag magazine formally debuted in October 1989 (after a test issue in 1988), and published its last issue in February 1991, for a total of 15 issues. It was a sort of cozy version of the New Yorker magazine. Similarly, it had a number of regular features, such as the Road Trip, the Map, and Letters From Home. The one that is of interest to us was called The Family Tree.

This feature mapped cultural relationships, having been described as "a field guide to the genealogy of influence in American life". It included human relationships, but it also included things like cars (the tree of which is reproduced in the book by Nobuhiro Minaka & Kunihiko Sugiyama. 2012. Phylogeny Mandala: Chain, Tree, and Network) and comic-book superheroes.

I have been unable to locate any decent copies, but four of the "trees" are included below.

As you can see, sometimes The Family Tree was actually a genealogical tree, but just as often it was simply a network of pairwise cultural connections. The latter, of course, usually formed a complex network that did not really map historical relationships.

This last Family Tree is from the original trial issue, and shows the inter-relationships of the writers and producers of American TV sitcoms.

The tree is based on mitochondrial genome data for the highlighted fossil, compared to the mitochondrial sequences of modern-day dogs and wolves, as well as ancient canids. The use of a phylogenetic tree seems to be based on the idea that mitochondria consist of tightly linked genes that are uniparentally inherited. However, neither of these characteristics is universal, and so a network might be more appropriate.

The dog genealogy is recognized as being characterized by introgression with wolves, as the authors themselves note. Also, the origin of dogs is not directly from wolf ancestors, but both modern wolves and modern dogs are derived from a common ancestor. For example, this next diagram is from:

The width of each population branch is proportional to inferred population size. Note that wolves and dogs originated at roughly the same time, as the result of bottlenecks in the ancestral population size. Wolves diversified slightly earlier than dogs. Also, Skoglund et al. dispute the dating of the splits, suggesting that the dog-wolf divergence was "at least 27,000 years ago".

As a final note, there is a tendency to credit Charles Darwin with originating just about everything in the study of genealogy, although he was a synthesizer as much as an innovator. For example, David Grimm suggests (Dawn of the dog. Science 348: 274-279):

Charles Darwin fired the first shot in the dog wars. Writing in 1868 in The Variation of Animals and Plants under Domestication, he wondered whether dogs had evolved from a single species or from an unusual mating, perhaps between a wolf and a jackal.

However, the first hypothesized genealogy was actually published more than a century earlier, by Georges-Louis Leclerc, comte de Buffon (see the blog post on The first phylogenetic network), who suggested a common origin with wolves.

Monday, June 22, 2015

We all know how painful it is to deal with computer login passwords. Computer administrators keep telling us to have "secure" passwords, and to not reuse them, but of course we ignore this advice. Who can remember all of these passwords anyway? So, we keep them simple, and we reuse them.

The SplashData group, which markets what they call a "secure password and record management solution", provide an annual list of the 25 most common passwords found on the Internet. These are compiled from leaked passwords posted online by hackers. I have looked at the lists for 2011, 2012, 2013 and 2014.

As usual, I have used a phylogenetic network as a form of exploratory data analysis. I first used the steinhaus similarity to calculate the pairwise similarity of the 43 passwords that appear — this similarity ignores what are called "negative matches" (which is important because most of the passwords do not appear in the lists for all four years). This was followed by a Neighbor-net analysis to display the between-word similarities as a phylogenetic network. So, passwords that are closely connected in the network are similar to each other based on their popularity across the four years, and those that are further apart are progressively more different from each other. Those passwords that are in the top 25 for all four years are marked in red.

You will note the similarity among many of these passwords. They are mostly simple combinations of numbers, words, or a row of keys on the standard English keyboard. Obviously, these are not secure passwords.

The numbers one and two passwords for all four years were "password" and "123456", with "12345678" right behind. Oddly, there has been a distinct increase in "1234", "12345" and "123456789" during the years — they are grouped at the bottom right of the network. The passwords grouped at the bottom left have decreased in popularity through time.

Clearly, many people do not take login security very seriously. However, the problem also comes from the fact that system administrators fob the job of security off on the users —there have been many discussions of the lunacy of asking users to use unique "secure" passwords for each and every system (eg. Robert McMillan, of Wired magazine: Do you really need a password you can barely remember?). Indeed, Mat Honan, also writing at Wired magazine, has pointed out that even secure passwords are out of place in the Internet world (Kill the password: why a string of characters can’t protect us anymore). It will be interesting to see what happens next.

 Stevens, Peter F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. New York: Columbia University Press.

A conceptual history of classification and phylogenetics, mainly as related to plants. Focuses on the early development of ideas within biosystematics, with accompanying illustrations. Networks are effectively treated as variants of trees.

A richly illustrated history of trees, with a few networks. Focuses on the illustrations, with some accompanying text. The best source to see what people have drawn in the way of trees, but weaker on networks.

In Japanese. Covers the development of the tree metaphor (with a few networks), as related to pedigrees, phylogenies, and knowledge representation in general. The breadth of the topic is indicated in the "mandala" of the title, which is "a generic term for any diagram, chart or geometric pattern that represents the cosmos metaphysically or symbolically".

A conceptual history of trees, with a few networks. Focuses on the development of the ideas, with accompanying illustrations. Starts with pedigrees, and proceeds from there to phylogenies. The best coverage of phylogeny concepts, but explicitly treats networks as "trees with reticulations".

Monday, June 15, 2015

I have noted several times in this blog that it is not just biological organisms that can be considered to have a phylogenetic history. Many human artifacts also do, provided that their history results from diversification from a common ancestor. For example, there are blog posts about the following topics:

All of these can be considered to have a phylogenetic history of shared common ancestors. For instance, manuscript copies do share ancestors — the source manuscripts that have been copied.

However, while all human artifacts have a history, not everything has a phylogenetic history. There can be transformational history, for example, where concepts simply change through time without diversifying. This can represented by a timeline rather than a phylogeny, as discussed in these blog posts:

There are also situations where artifacts simply cluster together, based on their similarity. This can be represented as a tree-like diagram or a network, but such a tree/network is not a phylogeny, because the clustering does not necessarily have anything to do with common ancestry. Examples discussed in this blog include:

The problem with this latter situation is that we can always mathematically measure the similarity between concepts or objects, and therefore we can always cluster them based on this similarity, even if the clusters have little meaning. I have previously discussed this issue in this blog, noting that if the similarity measure used does not model evolutionary patterns then it cannot be expected to produce a phylogeny (Non-model distances in phylogenetics).

Another case in point is the work of William Shakespeare. Can the plays, for example, be considered to have a phylogeny? Each play certainly has a phylogeny on its own, because the Shakespearean author is well known for having taken the ideas for the plays from previous sources. So, each play has a phylogeny (a reticulate history) based on the historical connections among its sources. However, the plays as a group do not have a phylogeny (not unless they have been plagiarized from each other, anyway). Does Othello really share a common ancestor with King Lear? It certainly has similarities, if only on the basis that it is one of the Tragedies (along with Macbeth, etc). But they are not phylogenetic similarities, and there is no common ancestral Shakespearean play.

As shown by the picture above, this point is not always appreciated. The alleged phylogeny is taken from a press release from the Lawrence Berkeley National Laboratory. The textual similarities among the plays are based on what are called "feature frequency profiles", which have nothing to do with evolutionary history. So, while the data analysis may or may not be helpful for identifying the author(s) of the so-called Shakespearean plays, it is not much help for constructing a phylogeny.

Wednesday, June 10, 2015

It is not easy to define exactly, what a language is. We find one
reason for this in the daily use of the word “language” in
non-linguistic contexts. What we call a language does not
depend on purely linguistic criteria. The criteria we normally use
are social and cultural.

If we were to define languages with help of
linguistic criteria, we would use the degree to which speakers
understand each other; and in most cases, we could draw some line
around areas of what linguists would call “mutual intelligibility”
(similar to the criterion of “interbreedability” in biology). But
mutual intelligibility does not usually serve as the criterion by
which we define languages in everyday situations. For example, we tend
to say that the people from Shanghai, Beijing, and Meixian (all
cities in China) all speak “Chinese”. On the other hand, we think
that people from Scandinavia speak “Norwegian”, “Swedish”,
and “Danish”, although there three are no more different than are the former three.

The table above (taken from List 2014: 11f, with adaptations) gives phonetic transcriptions of translations of the
sentence “The North Wind and the Sun were disputing which was the
stronger” in three Chinese “dialects” (Beijing Chinese, which is also called Mandarin or Standard Chinese, spoken in Beijing and all over the country as a second language; Shanghainese, spoken in Shanghai; and Hakka Chinese,
spoken in Meixian), and three Scandinavian “languages”
(Norwegian, Swedish, and Danish). In the table, I have put all words
that have the same meaning in one column (ie. I have aligned them semantically). Furthermore, I have highlighted the words which share
a common etymological origin (call them “homologs” or “cognates”)
with a gray background. In red, I have added a more or less literal translation of the respective column.

As the phonetic transcriptions of the sentences show, the Chinese
varieties differ to a similar, if not even
greater, degree as the Scandinavian ones. And we find this variation both in the way
the meaning of the sentence is expressed by the choice of words, and
in the degree of etymological similarity between the words. Note, further, that none of the three Chinese
dialects is mutually intelligible with any other of the dialects, while we know from famous TV
series like Broen/Bron
that Danish and Swedish people can often understand each other quite well
(with some effort); and Norwegians and Swedes are mutually intelligible most of the time. Nevertheless, we
address the latter three speech traditions as the three languages
“Norwegian”, “Swedish”, and “Danish”, while we say
that the speech of the people in Shanghai,
Beijing, and
Meixian are merely specific variants of one
and the same “Chinese” language.

Languages as Diasystems

One could say that this is just a cultural problem, not a linguistic one,
we are facing here. So we could say that there are two different ways
of distinguishing languages from dialects. One would be the
linguistic one, which uses mutual intelligibility as a unique
criterion to tell languages from dialects. The other one would be the
cultural definition of languages as, say, “dialects with an army”
(a definition usually attributed to Uriel Weinreich).

But this is,
unfortunately, only part of the real story, since the cultural
definition of the boundaries of a language has a direct impact on the
way languages evolve. In societies such as China, for example, a very
largeproportion of all speakers is bilingual. Apart from their home
dialect, speakers are also able to speak Standard Chinese (also
called Mandarin Chinese), and they use it to talk to people from
different regions, or to read and to write. So, from a pure
linguistic viewpoint, it is not necessarily useful to break up the
Chinese dialects into distinct languages, since these dialects are
located within a larger speech society that is united by a common
language for written and interdialectal communication.

In order to describe this complex structure of our modern languages,
linguists have proposed the model of the “diasystem”, which is
very common in the discipline of sociolinguistics. This model goes
back to the aforementioned dialectologist Uriel Weinreich (1926–1967) who originally thought of some linguistic construct which would
make it possible to describe different dialects in a uniform way
(Weinreich 1954). According to the modern form of the model, a
language is a complex aggregate of different linguistic systems,
“which coexist and mutually influence each other” (Coseriu 1973:
40, my translation from the German).

An important aspect for
determining a linguistic diasystem is the presence of a “Dachsprache”
(“roof language”). This is a linguistics variety that serves as a
standard for interdialectal communication (Goossens 1973: 11). The
different linguistic varieties (dialects, but also sociolects) that are connected by such a standard constitute the “variety space”
of a language (Oesterreicher 2001). I have tried to illustrate this
in the following figure (taken from List 2014: 13).

As you can see from the figure, there are different “dimensions”
according to which the varieties of a language can differ. The figure
shows three of them. First, there are “diatopic varieties” which
point to the division of a language into different dialects (varying
regarding the place where they are spoken).

Second, there are
“diastratic varieties”, pointing to different social layers in
which the varieties are used. Compare, for example, the language of a
football player with that of a politician, which are similar in their tendency to say nothing in many words (especially after hard defeats
or before unpopular decisions to be told to the public), but which differ a lot regarding their choice of words. Third, there are “diastratic
varieties”, which are varieties depending on the situation in which
people speak. Compare, for example, the way our politician speeks
when giving a speech to the public with the speech when discussion
big politics behind closed doors.

But these three dimensions of
language variation are not all that a diasystem of a language has to
offer! We can further identify different speech habits when looking
at the medium that is used to produce language; and there are
significant differences in many respects when writing or reading
something, or when speaking and listening. This dimension is commonly
called “diamesic” (varying in dependency of the “medium”).

Last, but not least, we should also note that we do not necessarily speak
and understand the language from only one time. Think of modern German kids
in school who are forced to read Goethe's Faust, bitterly
lamenting the old-fashioned style of the language, but think also about
different generations of speakers living in the same speech society.
This last dimension of language variety is usually called the
“diachronic dimension”. The following image tries to summarize
the different dimensions in which the diasystem of a language can
vary.

Diasystematic aspects of language change

Given all of these fancy terms starting with “dia” and ending in “ic”,
one may think that they are a mere play with thoughts developed by a
bunch of linguist geeks who are interested in sociology. Why can't we
just forget about all these different kinds of “variation” and
keep on modeling our languages as bags of words? Applying
computational methods from biology will be much easier, and as long
as we use networks once in a while, we are not completely giving
ourselves in to the dark side of the Force, which knows only trees.
Unfortunately, this is not possible, since the diasystematic
structure actually has an impact on the way in which languages
change!

As an example from practice, let me tell you how I tried to buy cigarettes
when I was in China for the first time. At the time, I had just started
to learn Mandarin Chinese, and was really suffering from the
difficulty of the language. But I had searched my dictionary several
times, and looked up all the important words I needed to tell the man
at the kiosk which cigarettes I wanted to have. My choice was
“Marlboro”, since it was the only brand I recognized.

Although
having only a complete beginner's knowledge of Chinese, I knew, as a
linguist, that the language is peculiar in one specific respect — it
has a very, very restricted structure of possible syllables. So one
can't say “Saint Petersburg” in Chinese, since syllables in
Chinese are not allowed to end in a “t” (as in “Saint”), an
“s” (as in the syllable “ters”), or a “g” (as in the
syllable “burg”). Instead, Chinese speakers will say Shèngbǐdébǎo. I also knew that there is no sound for “r”,
and that this sound is often rendered by using a “l” instead.

So, based on this background knowledge, I “translated” the
pronunciation of the word “Marlboro” into what I thought by then
was perfectly understandable Mandarin, and told the man at the shop
that I wanted to have a pocket of mābóluō cigarettes.
Unfortunately, he didn't understand at all, what I wanted, and only
when I pointed with my finger to the packets of Marlboro cigarettes did he finally understand, and say, “Ah, wànbǎolù !”.

So, I
learned that “Marlboro” in Mandarin Chinese is called wànbǎolù, not mābóluō, written 万宝路,
literally meaning 10 000-treasure-road, which can be translated as “road of 10 000 treasures”. (Good brand name,
actually, especially for cigarettes.) It was
only some months later that I understood why my prediction for the
Mandarin Chinese pronunciation of “Marlboro” failed so
dramatically, when I heard people from Hong Kong pronouncing the word
wànbǎolù万宝路
in Cantonese, the Chinese dialect they speak in
Hong Kong. There, wànbǎolù万宝路
becomes something like [maːn²²-pow³⁵-low³²]
(the numbers are tone marks), which sounds very, very similar to the
mābóluō I had falsely predicted for Mandarin Chinese.

In the image above, I have tried to depict the process by which “Marlboro” becomes the
“road of 10 000 treasures”. What we are
dealing with here is a complex pattern of change: both phrases,
Mandarin Chinese wànbǎolù and Cantonese [maːn²²-pow³⁵-low³²], are homologous. This applies to their three parts
(10 000 + treasure + road), since the phrase itself was presumably not present in earlier stages of
Chinese. In the ancestor language of Cantonese and Mandarin Chinese,
a variety we usually call “Middle Chinese” (spoken around 600
AD), the phrase “road of 10 000 treasures” would have sounded
approximately like [mjon³-paw²-lu³]. In Mandarin Chinese, the
pronunciation changed greatly, while it changed only slightly in
Cantonese.

When Marlboro
entered China, it was probably only sold in Hong Kong in the
beginning. So, in order to trigger the interest of Hong Kong consumers,
the marketing stragegists did a good job in choosing a translation
that sounded both very similar to the original product while at the
same time having a nice and promising meaning. They would use Chinese
characters to write down the product name. When Marlboro, or the
“road of 10 000 treasures” then entered the rest of China, people
would read the phrase, but pronounce it in their own way — reading the
Chinese characters in Mandarin Chinese just yields wànbǎolù,
and not mābóluō.

The transfer of the word from one dialect to another was thus made
via the diamesic dimension,
via the writing system,
not via the spoken language. And this is the way that many, many
words (also very basic terms) are exchanged between the Chinese
dialect varieties — via their “roof language”, which is the common
writing system. And since
this change doesn't involve the direct borrowing of a spoken word, it
is barely perceivable, since it leaves no direct traces in the pronunciation of the words. While normal borrowings in other languages usually sound outlandish, borrowings in Chinese dialects which make their way from one variety to another via the writing system just sound like any other possible word in the recipient dialect.

Summary

In
the same way in which languages may change via the interaction
between their written and spoken varieties, the interaction between
varieties from the other dimensions may also trigger change. Words
originating in one social layer may be transferred to other layers;
dialect words of one dialect
may become popular and henceforth be used in all dialects; and even
those varieties of our languages which are only accessible via
stories or books may be revived, at least in part, and find a
new steady place in our regular speech, up to the moment where we
again cease to use them. The
diasystematic structure of languages plays a crucial role in their
development. Due to the diasystematic character of languages, language change involves complex network-like structures within one and the same (dia)system. If we really aim to depict language evolution in all its complexity, then it is definitely not a good thing to ignore the diasystematic aspect of languages.

Monday, June 8, 2015

A few years ago I wrote a few guest posts on the Guest Blogge at the Scientopia site (as described in the post Moonlighting at Scientopia). That blog has been inactive since 2013. Unfortunately, gremlins have since crept into the system, and all of my posts are no longer credited to me. The by-line for all of the posts is now "Christina Pikas" rather than "David Morrison". She is the author of a different blog at Scientopia, Christina's LIS Rant.

I hope that this sort of thing does not happen with my online research publications, as well !

Wednesday, June 3, 2015

There are at least two misleading expressions that one very commonly encounters in the professional phylogenetics literature: "basal branch of the tree", and "derived species".

The first expression is used to refer to an unbranched lineage arising near the common ancestor, when compared to a more-branched lineage. For example, in the first diagram below we might say that taxon A is on a "basal branch", whereas taxon B is not. The taxa associated with taxon B are then referred to as the "crown" of the tree. But, how can one lineage be more basal than another? After all, both lineages connect to the "base" of the tree at the same point. To claim that one is basal and the other not is like saying that one brother is more basal than another in a family tree just because he has fewer children!

The second expression refers to a species that has more "derived" characters than another. For example, in the diagram we might say that taxon B is more derived than taxon A. Characters change from ancestral to derived through time (eg. scaly skin covering is ancestral while fur is derived, because the latter arose later in time). However, this does not make any species more derived. It is the characters that are derived not the species — each species has a combination of ancestral characters and derived ones (including humans).

These issues seem to arise from the tree iconography. Some people seem to conceptualize this as a pine tree rather than a bush (as drawn by Charles Darwin in the Origin). A pine tree, indeed, does have basal branches and a crown. Here is an example from a sign in my local botanical garden, which tries to explain plant phylogenetic relationships to the general public. This tree does, indeed, have basal branches and a distinct crown.

This issue seems to have started with Ernst Haeckel in the late 1800s. Haeckel's first phylogenies (see Who published the first phylogenetic tree?) were drawn as multi-branched bushes, rather similar to the diagram that Darwin himself had published. However, Haeckel then veered away from this approach when explicitly discussing the evolution of humans. Here, he drew a tree with a distinct central trunk and much smaller side-branches (presumably modeled on an oak tree, rather than a bush). This image emphasizes one particular lineage at the expense of the others, because there is one taxon obviously sitting at the crown of the tree while the others are relegated to side-branches.

This approach to drawing a phylogeny can be used to put any chosen organism at the crown of the tree, not just human beings, as illustrated by the following diagram from James Scott (which looks like it is actually modeled on a pine tree). This is a fundamental characteristic of a phylogeny — it can be drawn so that any part of the diagram is at the crown. However, to be accurate it should always be drawn so that no one lineage is emphasized over any other one — there should be no taxa sitting at the crown.

J.A. Scott (1986) The Butterflies of North America:a Natural History and Field Guide. Stanford University Press, Stanford.

Distorted images occur in several ways in modern evolutionary biology. This topic has received considerable attention in the literature, and there are a number of very readable expositions of various parts of it. Here is a brief list.

Monday, June 1, 2015

On this blog we occasionally draw your attention to the overlap between the scientific world and the artistic world. The language tree shown below is from the Stand Still Stay Silent site, which describes itself as "a post apocalyptic webcomic with elements from Nordic mythology". The tree data apparently come from the Ethnologue language database.

The detail about the Nordic languages derives from the fact that the author, Minna Sundberg, is Finnish-Swedish, and the Scandinavian languages have next to nothing in common with the Finno-Ugric languages.