April 30, 2010

The shape and tempo of language evolution (Greenhill et al. 2010)

This is an extremely interesting paper which addresses the claim that typological features of languages (e.g., whether they use Subject-Verb-Object) are more conservative than the lexicon. If that is the case, then typological features could be used to infer evolutionary relationships between languages that are older than ten thousand years or so (an upper limit on what can be inferred using vocabulary).

In general, the authors reject the idea of typological conservation, although they note that typological features differ in this respect, and some of them may appear to be conservative within some language family but evolve rapidly in another. Their tree reconstruction is able to infer well-known language families (e.g., Indo-European), or suspected ones (e.g., Nostratic), but the corresponding clusters are not robust (e.g., Hindi is broken away from the IE cluster, and unrelated non-Eurasian languages fall into the Nostratic one).

There are approximately 7000 languages spoken in the world today. This diversity reflects the legacy of thousands of years of cultural evolution. How far back we can trace this history depends largely on the rate at which the different components of language evolve. Rates of lexical evolution are widely thought to impose an upper limit of 6000-10 000 years on reliably identifying language relationships. In contrast, it has been argued that certain structural elements of language are much more stable. Just as biologists use highly conserved genes to uncover the deepest branches in the tree of life, highly stable linguistic features hold the promise of identifying deep relationships between the world's languages. Here, we present the first global network of languages based on this typological information. We evaluate the relative evolutionary rates of both typological and lexical features in the Austronesian and Indo-European language families. The first indications are that typological features evolve at similar rates to basic vocabulary but their evolution is substantially less tree-like. Our results suggest that, while rates of vocabulary change are correlated between the two language families, the rates of evolution of typological features and structural subtypes show no consistent relationship across families.

38 comments:

Re the colored, circular chart classifying the world's languages according to their typological features:

Group 2's features should be thought of as the neolithic innovation pattern, emerging with agriculture. The un-numbered group to the left represents the original, pre-agriculture pattern. Group 1 is, if not an actual hybrid of the two, at least a pattern that is mid-way between the others, sharing aspects of each.

Basque appears closest to Hunzibhttp://en.wikipedia.org/wiki/Hunzib_languageWhich is from the Northern Caucasus - right where one would expect, based also on Blood Type O-Negative, which is also high in this area.

It seems that also from this reasearch the Nostratic family has many possibilities to be real. And it was formed where we have always thought hg. R was born: among Caucasus/East and West Urals. Given the links with Sino-Tibetan I think having demonstrated in my studies I spoke on elsewhere in this forum, probably it isn't absurd to think to a very ancient time when hg.NO and P where in South Siberia. My thinking, as you know, is that some R (R1b1* etc.) came to Western Europe very early, between the LGM and the Younger Dryas.

While interesting, the study has some real methodological flaws and relies on a very incomplete data set (the WALS data).

The analysis is not very explicit at all on the issue of weighting. Not all language features in WALS are created equal, and this study doesn't appear to reflect this fact. Likewise, giving equal weight to similiar languages with recent shared histories, and ones that do not is problematic. If there is ever a case for being a Baysean, it is in a context like this one where we have a lot of context to guide our statistical inferences.

The big issue that the study expressly says it is trying to deal with is the degree to which non-lexical features are stable within IE languages and Austronesian languages. The key question is the degree to which particular language features: (1) derive from parent languages, (2) derive from substrate languages, (3) derive from areal relationships (i.e. borrowing based on geographical proximity), and (4) are random. But, the study design is ill suited to capturing this issue. A particularly glaring omission the the absence of proximity and lexical language commonality as variables to be considered in the study.

If adjacent languages known to be part of different language families based on history and lexicon tend to share a feature, this is a good argument that the feature is areal or due to a substrate. Conversely, if adjacent languages known to be part of different language families preserve distinctions in other language features, this is a good argument that the feature is due to a parent language and strongly conserved over time.

Similarly, when features are shared by languages spoken by groups with the same population genetics, but not by adjacent languages spoken by people with different population genetics, this argues for either substrate or parent language influences; while features shared across these lines are likely to be areal.

The most interesting data based on this kind of analysis would be one showing the most and least areal features in the WALS data supplemented by proximity and FsT population genetic similarity data.

Eyeballing the WALS data in maps (which reveals proximity issues) and with a knowledge of population genetics, for example, make the exteme distinctiveness of the various Caucasian languages in both certain grammar features (like ergativity) and phonetics leap off the page. The case from proximity for egativity not being a trait that is spread areally is very strong.

Another data set that is missing and important for the kind of analysis the study is purporting to do is to look at actual evolutioon of features over time when there is a continuous data set in literate languages. What features actually varied from Old English to modern English, from Sanskrit to Hindi, from early Akkadian to late Akkadian? Scratch those from the list of conservative features.

We also have some very good evolution of language data in cases of known substrates, like the replacement of Sumerian by Akkadian where there was a known period of bilingualism. This also makes a case for a lack of substrative to superstrate carryover of ergativity.

In short, dumb statistics don't cut it, and one needs the right data set and the right use of that data set to answer the questions asked in this paper.

Andrew, the authors cull languages that have many missing features, and they also study different features separately, so your points really don't stand.

Also, if you take the idea that certain languages form a family, then of course you will discover features that are stable within them. But, that's hardly interesting. What the authors are interested in is discovering new language relationships, especially older ones for which typological features have been claimed to be stable.

All in all, I think that the paper is excellent at what it does, which is cast doubt on the alleged stability of typological features or their superiority to the lexicon.

This paper confirms the Euro-Asiatic (but not the lislakh hypothesis of Carleton Hodge) hypothesis of Greenberg, (although the presence of languages such as Basque, Burushaski and Ingush in the first group is somehow odd as those languages are said to be Sino-Caucasian)and also the Nostratic hypothesis of Pedersen.

But in the same time undermine the Afrasan hypothesis as it showed that South Afrasan (Kushitic, Omotic etc...) is very distant from North Afrasan (Egyptian, Berber, Haussa, Semitic).

May be good effort. If we go by the words of Dienekes, it goes from extremely interesting paper to extremely amateurish paper.

So many Indologists previously said some of the typological features of Indian Indo Europian languages are from Dravidian sub stream. The consideration for Dravidian languages is very minimum. which also provided script for lot of ASEAN countries like Thai,Burmese, Kambodian etc. that aspect makes this paper ridiculous.

Quechua? How could that possibly be related to to languages like Turkish or Hindi?

Random noise / chance.

I like that the typology groups the "Balkan" languages across 3 to 4 IE subfamilies. Probably both substrate and diffusion at work. I am puzzled though how far away Greek is listed in both methodologies - anecdotally, I see at least (if not more) shared words with Germanic as with Slavic languages, especially if you look at similar or related meaning, and not exact matches.

Irish is also interesting. I have argued before that insular Celtic is extremely different from what we know of "Celtic" languages close to the western Alps - which to me appear much closer to Latin and Germanic. And I still don't believe that the language spoken ~2,500 years ago east of the Rhine and south of the Danube was anything resembling Celtic at all - likely much closer to proto-Germanic, given the ~1,000km shared "border" that in fact has no significant geographic obstacle whatsoever.

"In general, the authors reject the idea of typological conservation"...

Thanks goodness! I was already establishing the 'nasal theory of language families': what matters is the shape of the nose, that's what makes Evo Morales and Oteiza speak "similar": their big squarish noses. :D

Now seriously, one may come to suspect that beyond phylogeny and mere areal features, part of the typological structure might reflect some sort of similar substrate, some particular way of thinking of language, any language (for instance the existence of certain phonemes or not, or how tonality affects speach if at all). In fact this would not be too different than areal features or even phylogeny itself. But seems difficult to disentangle the mesh.

Andrew, excellent post, and a good illustration of why statistical comparison of large numbers languages is really not a very useful tool.

The processes through which languages change are really far more complicated than anything we find in genetics. Even if we are looking at time depths of just 3000 years, we have to take in to consideration the effects of sprachbunds, processes of creolisation, adult second language acquisition and diglossia, not just with known language families, but with languages which have left no historical record.

I'd also like to add a couple of problems with the analysis.

1. Typological features have not been proven to be stable over the massive depths of time neede for this analysis to have value. In fact they are clearly subject to language contact effects and language change. The S-V-O example given in the post illustrates this perfectly. English has been SVO for 500 years, for 500 years prior to that it was becoming SVO. Prior to that it was an analytical language, where various sentence orders were possible.

Imagine how many more of these effects have occured over 10 or 20k years? It more or less destroys any hope of accurately finding links.

2. Typological similarities are not always significant. One could easily declare Greek and Basque were both analytical languages, and so similar. But the way they are analytical is completely different. In individual cases, one can see the mistake easily, but with a statistical analysis it will be missed.

Actually to the whole NE Caucasian family: they coverge very clearly well below (rather 'above' in the graph) where they converge with Basque. One may also argue that they also converge with Burusho and NW Caucasian (Abkhaz) well before converging with Basque.

This could be some support for the Vasco-Caucasian hypothesis and specially for my favorite version: Basque-NE Caucasian-Hurro-Urartean-Sumerian with a Gravettian origin for all.

"our analysis of rates of evolution failed toidentify any typological features that evolve at consistently slower rates than the basiclexicon. If the signal in the lexicon does stretch back as far as 10,000 years then our results suggest that typological datais constrained by a similar time horizon."

One to remember when looking at the chart at the top of the article...

On a related thought, the graph does suggest a further grouping of this putative Vasco-Caucasian-Burusho with 'Dravido-Georgian', 'Altai-Andean' and maybe also Barbacoan (Awa Pit) and Uralic at a very deep coomon root, distinct from Indoeuropean, which might be rooted at the colonization of West Eurasia and Central Asia (extensions as far as Mongolia). Just a hunch that makes some potential sense to me.

Maju, remember that the great finding of Alfredo Trombetti (La lingua basca, 1925) was that Basque was linked with the Caucasian Languages. As we know now that there is a macro-group (Basque-Caucasian-SinoTibetan-Na-dené) to which I think belong Sumerian, the link of Basque with Sumerian is possible. Of course other languages have had some mingle that Basque and Sumerian haven't had, not being in contact from very ancient time. The relatedness could go back to tens of thousands of years.

Gioello: sure, the Basque-Caucasian tentative connection has been there for a while but the very structure of the three Caucasian language families itself is a matter of controversy and whatever connection they have with Basque is very blurry.

There is not one Caucasian family but three (NW Caucasian, NE Caucasian and Kartvelian) and they have never been conclusively connected to each other. NW Caucasian has been related to Hattic and NE Caucasian to Hurro-Urartean, while Kartvelian sometimes shows up into the Nostratic hypothesis.

However, the most convincing stuff I have read was about a NE Caucasian-Basque connection. Also I have toyed a bit with shrunk-down versions of mass lexical comparison using numbers 1-5, in fact looking for potential cousins for Sumerian. What I 'found' was that Sumerian seemed closest in this aspect to NE Caucasian-Hurro-Urartean and that Basque also showed up at a most distant position in that same grouping.

Considering that archaeological evidence places the origins of Sumerians at the Zagros Neolithic and that its precursor, the Zagros Epipaleolithic (Zarzian culture) is very possibly derived from Eastern European Epigravettian, via the Caucasus, the elements converge at Gravettian, so it's only logical that the distance is so huge and so hard to spot and confirm.

Another possibility might be Neolithic but we should see much more clear affinities in that case.

Before the beginning of modern linguistics, during the 19th century, that every language derives from Hebrew was a common belief. This is in line with the belief that every man derives from Adam and that the world has the age of the Jewish calendar and that man has been made at image of God.

Of course in linguistics and genetics things are very different.

Anyway Muslims do think that Arab was the language of Allah, for this they don't translate it.

mr Maju, I did not understand how you extrapolate such conclusions, have you read my commentie=indo-european so basque bi(2) is similar to ie bi as in (bi)national,(bi)directional.The numbers you gave are modern Araic ones not Semitic nor aa (afro-asiatic).I think you now that hurrian(6)and ie+karvelian+altaic+uralic (7)[which is very similar to basque ones] are considered loans from ps to pie,p uralic,p kartvelian,hurrian (and perhaps basque) by mainstream linguists.Here I rewrite my comment (ie=indo-european,aa=afro-asiatic,and please note that I did no comment on the numbers 4,9,10)

"so basque bi(2) is similar to ie bi as in (bi)national,(bi)directional".-

Problem is that bi is not indoeuropean. The IE root for two is dwos and the Latin word is similar: duo.

'Bi-' is a particle taken by Latin from some other pre-IE language (Ligurian? Iberian?) and extended via Latin through many languages of the West, including Basque itself. Making 'bi' be IE is forcing things a lot. It's a clear case of Vascoid substratum in West Europe.

"The numbers you gave are modern Araic ones not Semitic nor aa (afro-asiatic)".

Clear now? Otherwise please point to the exact connections with living or even proto-AA reconstructions (hard to make looking at such high diversity).

"I think you now that hurrian(6)and ie+karvelian+altaic+uralic (7)[which is very similar to basque ones] are considered loans from ps to pie,p uralic,p kartvelian,hurrian (and perhaps basque) by mainstream linguists".

It is possible that Basque sei (6) and maybe even sazpi (7) may well be loans from some IE, like Vulgar Latin. But my exercise only dealt with numbers from 1 to 5 (because in Sumerian six, etc. are said "five-one", etc. and I was looking for Sumerian relations when I undertook it, not Basque ones - also because these low range numbers seem more linguistically stable, conservative).

However I'm not really persuaded by the theories that consider that proto-Semitic influenced PIE. I'd rather think that both have similar influences by other "Neolithic" languages of the area, but hard to tell with the limited information we have. IMO PSem and PIE never really interacted directly.

Thank you for your comment, I recommand you to read Blazek book on numerals, for example pie (2) is not connected with p north aafrasan (2) but rather with proto north afrasan for twin and similar case is for pie 5 with ps fist.

All mainstream indo-europeanist (including even all the most "ie'centrists" ones) recognize pie 7 as a loan from ps, the ps of 7 has a semitic etymology , semitic morphology and afrasan parallels.

The more liberal mainstream ie'ists also find tentative that pie 7 and 3 are ps loans or common ps-pie roots.

I have already discussed above that "bi" just cannot be IE. Even if one might argue for a b@ <> d@ (where @ is any random vowel, so we can think of dwos and bi as hypothetically related - quite forced but anyhow), the relation should be most remote, highly archaic: Paleolithic. Even English 'two' is a zillion times closer.

Let's see the other numbers:

1-bat. Potentially there might be a connection with some AA words: at/att in some Amharic languages, with close relatives in Hebrew (ahat, axat) and South Arabian t'ad. A few non-Semitic AA languages also have similar forms of 1 (adda, ta, da), so it's possible (though rather hard to explain).

3-hiru (pronounced iru, the 'h' is a modernism of Occitan influence). Any relation with IE *treyes can only be in letter R, hence not closer than for 2 (see above). I could not find any AA word that would be closer than PIE/modern IE.

4-lau (*laur). Rien de rien. You acknowledge this.

5- bost (*bortz). Proto-Berber *fuss with possible connections in Omotic only. Very tentative, specially as modern Berber is rather different.

7- sazpi. Might be from IE *sweks via Latin septem. But I think it was Krutwig who noticed that sazpi and sortzi (8) could be read as S+azpi/ortz, meaning 'azpi' below and 'ortz' the sky (above). He had some speculations on this and its hypothetical relation to the Scottish Mason pillars Jakin (in Basque meaning 'to know') and Boaz (in archaic Basque 'boz' means heart ('bihotz' modernly) and happiness ('poz' modernly)). But, well, he belived Picts spoke Basque... IDK.

8- sortzi. See above. I don't think it can have any relation with PIE *okto: except in the extremely remote sense as with 2 and 3.

9- bederatzi. Such a long word should be an artificial creation. I have recently speculated on a Megalithic age possibility here. Most likely with Basque etymology, IMO.

10- hamar (read: amar). Might be related to prto-Berber *meraw (and many derivates in modern Berber).

What would I make of all this?

A. There may well be a very remote Basque-IE connection (Gravettian?). I have already detected a few other basic words, like Basque 'izan': to be (however mainstream linguists consider 'izan' a modern word, what is contradicted by the Veleia inscriptions - but these have been 'inquisitioned' by the linguists popes' camarilla). However I don't see any reason to think of a modern (post-Paleolithic) connection: such thing would be much more obvious.

B. There may well have been Vascoid influence in Berber, both at the pre-AA Oranian substratum and in the Megalithic Age. Otherwise, I'd consider the occasional connection with mostly Ethiopian languages product of mere chance. Anything else would need much stronger demonstration.

C. I don't think that because an element or series of elements are (arguably) present in two distinct language families, that means any sort of phylogenetic influence. Areal influences (sprachbund) and shared influence from a third language or group of languages can perfectly and often explain better such borrowings.

Thank you very much, here another amateurish interpretation (in French) that shows that all norafrasan and iranohittite (or indohittite) numbers are interconnected (ie the norafrasan and ih are genetically connected and arose in the middle east after the demographic explosion that occured after discovery of agriculture=>that's why we can see these numbers in such distant language families as Na-Dene, Altaic, Uralosiberic, Hurrourartean...)

Old Blog Archive

Dienekes' Anthropology blog is dedicated to human population genetics, physical anthropology, archaeology, and history.

You are free to reuse any of the materials of this blog for non-commercial purposes, as long as you attribute them to Dienekes Pontikos and provide a link to either the individual blog entry or to Dienekes Anthropology Blog.

Feel free to send e-mail to Dienekes Pontikos, or follow @dienekesp on Twitter.