December 10, 2003

Irresponsible Punditry

The paper "Language Tree Divergence Times Support the Anatolian Theory of Indo-European
Origin" discussed in a previous posting was the subject of an article
by
Boston Globe staff writer Gareth Cook
in the Thanksgiving Day issue
(p. A16). The title, "A new word on birth of Western languages", is a little odd
since the Indo-European languages include not only most of the languages of Europe but
most of the languages of such non-Western countries as
Iran (Persian),
Armenia (Armenian),
Afghanistan (Dari, Pashto),
Pakistan (Panjabi, Urdu),
India (Sanskrit, Hindi, Gujarati, Bengali, Assamese, Marathi, Oriya),
Nepal (Nepali),
Bangladesh (Bengali),
and
Sri Lanka (Sinhalese),
as well as Kurdish, spoken in Turkey,
Iran, and Iraq,
and Tocharian A and B, once spoken in Chinese Turkestan,
but the article itself is pretty good. It does, however, contain one irritating bit:

Gray was trained as a biologist, not a linguist, which some scientists said could explain
the generally cautious reception yestereday's paper was greeted with among linguists.
"Partly, I think they are irritated", said Luigi Luca Cavalli-Sforza, who is a leading
expert on historic population migrations and a professor emeritus at Stanford
Medical School. "It is a very good paper."

Cavalli-Sforza is indeed a distinguished geneticist, whom I first encountered via
his book Cultural Transmission and Evolution: A Quantitative Approach
which I read with pleasure many years ago and still own.
But as far as I can tell Cavalli-Sforza has no reason whatever to think that
the cautious reaction of linguists to the paper was based on anything other
than legitimate scientific issues. There are some,
discussed in my previous posting.
I know of no evidence that anyone's reaction was based on irritation. He's just blowing smoke.

For the record, here are the comments that I sent Gareth Cook when he was writing the
piece in question. It seems to me that
they make a few technical points, are in many ways positive about the paper, and withhold
final judgment until I can find out more about what exactly the authors did.
Its fair to characterize them as cautious, but I don't see any irritation.
You can judge for yourself.

The paper by Gray and Atkinson is a serious paper.
It shows familiarity with the literature and attempts to address
the known problems with glottochronology and methods of dating
based on lexical turnover. And they used a reasonably reliable
source of data and information about cognation. They have also
taken a number of precautions to ensure that their results
are not the result of chance and to see that their assumptions
are not influencing the results. So it compares quite favorably with the
junk that we sometimes see in which people apply a technique
from another field to a problem that they don't really understand,
often with poor data sources.

The main question that this paper leaves me with is whether their
technique adequately addresses the fact that the rate of lexical
replacement is known not to be constant. They acknowledge the issue
and say that "the assumption of a strict clock can be relaxed by
using rate-smoothing algorithms to model variation across the
tree." The reference they give is to what appears to be the manual
for a piece of software. I'm not familiar with this, so on short notice
I simply can't tell whether it adequately addresses the problem.

The other problem, pointed out by Don Ringe, is that it isn't clear
what exactly they have done with their cognate sets. Dyen et al.
contains Swadesh 200 word lists for 95 languages.
They excluded 11 languages that Dyen et al. did not
code, which leaves 84. Then they added Hittite, Tocharian A, and Tocharian B.
So they should have 200 cognate sets across 87 languages. If they were
using methods of the sort I am most familiar with, each cell in the matrix
would have a value indicating either "for this lexical item this language
retains a reflex of the reconstructed Proto-Indo-European etymon" or
not. But that can't be what they have done since they talk about 2,449 cognate
sets. So they've apparently split each gloss into multiple cognate sets,
and they don't explain how.

I have an idea of what they might have done, but its just a guess.
Perhaps they have used each subset of cognate words as a "cognate set". For example,
the PIE word for "bear" is believed to be the ancestor of Latin ursus,
Greek arktos, Sanskrit rkshas, Welsh arth (as in the name Arthur) etc.
However, this doesn't show up in Germanic and Balto-Slavic. Germanic languages
have words like English "bear", German baer, Old Norse bjorn - evidently
they referred to bears as "the brown ones". In Slavic you get words like
Russian medved, literally "honey eater". Presumably, this reflects taboo-ing
of the original word for bear. Anyhow, in a case like this they might
have treated cognates of ursus as one cognate set, cognates of bear as another
cognate set, and cognates of medved as a third cognate set. There's nothing
wrong with that, as far as it goes. But "has a cognate of ursus", "has a cognate
of bear", and "has a cognate of medved" are not independent - e.g., if a language
has a cognate of ursus as its word for "bear", it doesn't have a cognate of medved.
So if your technique assumes the independence of the characters, you can't do this.

It's quite possible that whatever they've done is not problematic - I can't tell
because they don't give sufficient detail.

A minor comment is that it is a little odd to use the Romance languages when
they are known to descend from Latin, which of course is well attested.
Using the daughters rather than the ancestor can only add noise.
Presumably they didn't use Latin because Dyen et al. don't give Latin data.

We expect scientists to provide objective commentary based on a knowledge of the subject,
not insinuations about the alleged motives of those who disagree with them. We can leave
that to the Postmodernists in the literature departments. So you might think that
Cavalli-Sforza's remark was merely an addendum to a discussion of the scientific
issues and suppose that
the newspaper is at fault for reporting only the fluff. That probably isn't what
happened though: this wouldn't be the first time that Cavalli-Sforza has substituted
unfounded, ad hominem remarks for intelligent commentary.

Cavalli-Sforza is a staunch defender of the late Joseph Greenberg, whose 1987
book Language in the Americas is generally considered by historical linguists
to be worthless, partly because its methodology is invalid, and partly because Greenberg's
handling of the data is so appallingly bad. Cavalli-Sforza hasn't made any attempt
to defend Greenberg's data, and his attempts to defend Greenberg's methodology contain
nothing of substance. Let's take an example.
In his book Genes, Peoples, and Language he says (pp. 137-138):

...some anti-Greenberg linguists believe it is impossible to
posit a quantitative relationship between any two languages.
By disallowing reliable measurements, and by limiting the
relationship betweeen two languages only to "related or
not related", the American linguists opposing Greenberg have
ruled out the possibility of hierarchical classification, an
essential prerequisite to taxonomy.

Now, this is perfect nonsense. I think it is fair to say that all of the linguists
who have criticzed Greenberg's work believe in degrees of relationship, that is,
that some languages are more closely related to each other than to other languages.
I have never heard ANY linguist express the view described by Cavalli-Sforza.
Virtually every book and paper on historical linguistics assumes a hierarchical
classification. To claim that historical linguists are critical of Greenberg
because they don't believe in degrees of relationship is like claiming that
biologists are critical of Lysenko because
they don't believe in evolution.

It is also striking that such an amazing claim is supported by no evidence.
Cavalli-Sforza doesn't even name any of the linguists who allegedly hold this
amazing view, much less supply quotations from their work or references to it.
That's because there isn't any supporting evidence.

Just to be sure, I asked Cavalli-Sforza if he could offer any support for his claim:

From wjposer Sat Feb 1 13:26:19 2003
To: cavalli@stanford.edu
Subject: degrees of relationship
Content-Length: 698
Status: RO
Dear Professor Cavalli-Sforza:
In your book Genes, Peoples, and Language at pp. 137-138 you say:
...some anti-Greenberg linguists believe it is impossible to
posit a quantitative relationship between any two languages.
By disallowing reliable measurements, and by limiting the
relationship betweeen two languages only to "related or
not related", the American linguists opposing Greenberg have
ruled out the possibility of hierarchical classification, an
essential prerequisite to taxonomy.
I wonder if you could supply the names of the linguists who
take this position and references to publications in which
they have done so. Thank you.
Bill Poser

He provides no support for the claims in his book, no references, no names.
In fact, he admits that he doesn't have any firsthand knowledge of what he is
talking about and has taken his views from Joseph Greenberg, the very person
the critics are criticizing. Caveat lector.