Friday, September 17, 2010

The Indus argument continues

Last year much excitement and noise occurred, including on this blog [1, 2, 3], when a group of scientists (led by Rajesh Rao at the University of Washington, and including my colleague Ronojoy Adhikari) published a brief paper in Science supplying evidence, on statistical grounds, that the Indus symbols constituted a writing system. In their words, they "present evidence for the linguistic hypothesis by showing that the script’s conditional entropy is closer to those of natural languages than various types of nonlinguistic systems."

This rather modest claim outraged Steve Farmer, Richard Sproat and (presumably) Michael Witzel (FSW), who had previously "proved" that the Harappan civilization was not literate (the paper was subtitled "The myth of a literate Harappan civilization"). In a series of online screeds, they attacked the work of Rao et al: for reviews, see this previous post, and links and comments therein.

Now Richard Sproat has published his latest attack on Rao et al. in the journal Computational Linguistics. Rao et al have a rejoinder, as do another set of researchers, and Sproat has a further response to both groups (but primarily to Rao et al); all these rejoinders will appear in the December issue of Computational Linguistics.

To summarise quickly, the way I see it: Sproat claims (as he previously did on the internet) that Rao et al.'s use of "conditional entropy" is useless in distinguishing scripts from non-scripts, because one can construct non-scripts with the same conditional entropy, and because their extreme ("type 1" and "type 2") non-linguistic systems are artificial examples. Rao et al. respond that that is a mischaracterisation of what they did, observe that Sproat entirely fails to mention the second figure from the same paper or the more recent "block entropy" results, and repeat (in case it wasn't obvious) that they don't claim to prove anything, only offer evidence. They give inductive and Bayesian arguments for why the mass of evidence, including their own, should increase our belief that the Indus symbols were a script.

In connection with the Bayesian arguments, Rao et al. do me the honour of citing my blog post on the matter, thus giving this humble blog its first scholarly citation. My argument was as follows: Given prior degrees of belief, $P(S)$ for the script hypothesis and $P(NS)$ for the non-script hypothesis, and give "likelihoods" of data given each hypothesis, $P(D|S)$ and $P(D|NS)$, Bayes' theorem tells us how to calculate our posterior degrees of belief in each hypothesis given the data:$P(S|D) = \frac{P(D|S)P(S)}{P(D|S)P(S) + P(D|NS)P(NS)}$We can crudely estimate P(D|NS) by looking at the "spread" of the language band of the figure 1A in their Science paper and ask how likely it is that a generic non-language sequence would fall in that band: assuming that it can fall anywhere between the two extreme limits that they plot, we can eyeball it as 0.1 (the band occupies 10% of the total spread) [Update 17/09/2010: See the plot below, which is identical to the one in Science, except for the addition of Fortran (blue squares, to be ignored here).] Let us say a new language is very likely (say 0.9) to fall in the same band. Then $P(D|NS) = 0.1$, $P(D|S) = 0.9$. If we were initially neutrally placed between the hypotheses ($P(NS) = P(S) = 0.5$), then we get $P(S|D) = 0.9$: that is, after seeing these data we should be 90% convinced of the script hypothesis. Even if we started out rather strongly skeptical of the script hypothesis ($P(S) = 0.2$, $P(NS) = 0.8$), the Bayesian formula tells us that, after seeing the data, we would be almost 70% convinced ($P(S|D) = 0.69$).

We can quibble with these numbers, but the general point is that this is how science works: we adjust our degrees of belief in hypotheses based on the data we have and the extent to which the hypotheses explain those data.

Sproat apparently disagrees with this "inductive" approach, and accuses Rao et al of lack of clarity in their goals. On the first page, he clarifies that he was talking only of the Science paper and has not read carefully analysed [correction 17/09/10] the more recent papers by Rao and colleagues; he says those works do not affect questions on the previous paper, writing,

'To give a stark example, if someone should eventually demonstrate rigorously that cottontop tamarins are capable of learning “regular” grammars, that would have no bearing on the questions currently surrounding Marc Hauser’s 2002 publication in Cognition.'

In this way Sproat succeeds in insinuating, without saying it, that the work of Rao et al. may have been fraudulent. (Link to Hauser case coverage)

A little later, on the claim that the arguments of FSW "had been accepted by many archaeologists and linguistics", he offers this belated evidence that such people do exist:

Perhaps they do not exist? But they do: Andrew Lawler, a science reporter who in 2004 interviewed a large number of people on both sides of the debate notes that “many others are convinced that Farmer, Witzel, and Sproat have found a way to move away from sterile discussions of decipherment, and they ﬁnd few ﬂaws in their arguments” (Lawler 2004, page 2029), and quotes the Sanskrit scholar George Thompson and University of Pennsylvania Professor Emeritus of Indian studies Frank Southworth.

Having thus convincingly cited a science reporter to prove that the academic community widely accepts FSW's thesis, he proceeds to the actual claims about the symbols; after a few pages of nitpicks not very different from the above, he addresses a point which he had previously raised in this comment: why does figure 1A in the Science paper not include Fortran? He suspects that Fortran's curve would have overlapped significantly with the languages, "compromising the visual aspect of the plot". I actually find that explanation credible(*), and I was not comfortable with the manner of presentation of the data in the Science paper: but I view this as a problem with the "system" rather than the authors. Enormous prestige is attached to publication in journals like Science. To allow more authors to publish, Science has a one-page "brevia" format (which Rao et al. used) that allows essential conclusions to be presented on that printed page, while the substance of the paper is in supplementary material online. Rao et al. can argue, correctly, that they hid nothing in their full paper (including the supplementary material); but obviously what was shown in the main "brevia" format was selected for maximum instantaneous visual impact. And they are not the only ones to do this. I'd argue that formats like "brevia" are designed to encourage this sort of thing, and the blame goes to journals like Science. It is annoying, but to compare it with the Hauser fraud is odious.

Sproat's response doesn't improve in the subsequent pages. He distinguishes between his preferred "deductive" way of interpreting data and the "inductive" approach preferred by Rao et al; he complains that they did not clarify this in their original paper (though I would have thought the language was clear enough, that they nowhere claimed to be "deducing" anything, only offering "evidence"); he nitpicks (as I would have expected) with the Bayesian arguments. Overall, for all his combativeness, he is notably vaguer in his assertions than previously. He ends on this petulant note:

I end by noting that Rao et al., in particular, might feel grateful that they were given an opportunity to respond in this forum. My colleagues and I were not so lucky: when we wrote a letter to Science outlining our objections to the original paper, the magazine refused to publish our letter, citing “space limitations”. Fortunately Computational Linguistics is still open for the exchange of critical discussion.

The openness of CL is to be applauded, but I can think of some additional explanations for why Computational Linguistics allowed the response while Science did not. One is that the Science paper by Rao et al. was not a vicious personal attack on another set of researchers, and as such, did not merit a "rejoinder" unless it could be shown that the paper was wrong. Another may have been the quality of Rao et al's response on this occasion (Sproat could, if he liked, offer us a basis for comparison by linking his rejected letter to Science) [update 17/09/10: here].

I don't expect this exchange in a scholarly journal to end the argument, but perhaps the participants can take a break now.

(*) UPDATE 17/09/2010: Rajesh Rao writes:

By the way, the reason that Fortran was included in Fig 1B rather than 1A is quite mundane: a reviewer asked us to compare DNA, proteins, and Fortran, and we included these in a separate plot in the revised version of the paper. Just to prove we didn't have any nefarious designs, I have attached a plot that Nisha Yadav created that includes Fortran in the Science Fig 1A plot. The result is what we expect.

The plot is below (click to enlarge); the blue squares are the Fortran symbols.

Rajesh also remarks that the Bayesian posterior probability estimates -- that I derived from the bigram graph in the Science paper -- can probably be sharpened from the newer block entropy results. However, since Sproat makes it clear that he is only addressing the Science paper and is unwilling to let later work influence his perception, I think it's worth pointing out that the data in the Science paper are already rather convincing.

16 comments:

This is a case of not referring to figures. In the analyses made, a simple point is missed: the figures used on inscriptions While it is okay to statistically analyse 'signs', it is also necessary to understand that many pictorial motifs or what Mahadevan calls 'field symbols' occupy a dominant space on thousands of inscribed objects. The motifs + signs are part of the message.

Read Indus Script Cipher, the new book on the block. http://www.amazon.com/Indus-Script-Cipher-Hieroglyphs-linguistic/dp/0982897103/ref=sr_1_1?ie=UTF8&s=books&qid=1284721027&sr=8-1#reader_0982897103

Kalyanaraman - I don't think anyone is "missing" the importance of the figures. At the moment, they are not trying to decipher the text, only to determine its structure.

Richard -- thanks for the corrections: I have amended the former claim, and linked your letter.

You say But I do agree with one thing here: it's time to give this a break.

and, you know, it had a break for well over a year. There were further publications by Rao et al, but no major fuss as far as I can tell. So, guess who stirred it up again -- and that without caring to analyse the more recent papers?

I wouldn't have done this at all if it had not been for the Lee et al. paper. When Lee et al.'s paper appeared, I started getting worried we were going to see a series of papers using entropic calculations to make some point or other about various random symbols, and I felt it was time to point out the issues with those approaches. So blame the Royal Society.

As for not considering the more recent papers. You want me to do something on those too? I could...

But as I said, I believe that when Rao and colleagues published their original paper in Science they intended it to stand on its own merits. As such, it has to be able to stand on its own merits. And I don't think it does.

It is indeed unfair to mention Marc Hauser in the rebuttal. At a subliminal level, it would surely influence the reader to think that there is some sort of scientific misconduct here. This is perhaps scientific-rivalry's equivalent of hitting below the belt.

Stating the obvious, the review process in any peer-reviewed journal typically involves a few reviewers. Once a manuscript passes that stage and is published, it will be subject to further examination by the scientific community (a set that is much larger than the set of reviewers). Once the paper is published, it is but natural that researchers who did not review the particular paper will ask questions. Peer-review is not without fault, it may sometimes happen that the paper may go to a reviewer who holds a view that is: a) contrary to what is claimed in the paper, b) similar, c) neutral to the subject. A sloppy choice of reviewers may cause the paper to be sent to reviewers who are not necessarily neutral to the issue at hand. Considering that reviewers are anonymous, is circumstantial evidence sufficient to claim that no computational linguists were among the reviewers of Rao, et al.'s original paper? It may be true of one paper. But subsequently, there has also been a publication in PNAS. What is the likelihood that PNAS too did not include computational linguists among the reviewers? Or is it just that they did not include reviewers who held a particular view about the Indus scripts?

In any case, it is natural that once a paper is published several questions are asked, thus requiring further clarifications about the work, especially considering that the original Science paper presents the views of authors who you wouldn't typically bump into (outsiders?) at meetings of computational linguists.

On a lighter note, the whole issue reminds one of the Monty Hall problem. Neither of the two parties can claim to be Marilyn vos Savant till we know beyond doubt what the symbols really stand for. Just saying- conditional probabilities cause much confusion. :-)

I'll note that Hauser has not so far been formally charged with anything.

As for what I meant by that example: I figured people would know about the Hauser case, since it's been in the news so much, and I merely wanted to point to a thus familiar example to illustrate, starkly, what is at issue here by considering the need to evaluate a paper on its own merits.

Of course people will make all sorts of guesses to my real intentions, and there is nothing I can do to stop that process.

The allegation against Marc Hauser is one of scientific misconduct. His fall from grace is what most of the scientific community is familiar with. While he has not been charged with misconduct, internal investigations have shown, quoting a news report in UK's 'The Telegraph':

&quot

Eventually, the university confirmed – in a letter emailed by Michael Smith, the dean of arts and sciences – that "Professor Marc Hauser was found solely responsible, after a thorough investigation by a faculty investigating committee, for eight instances of scientific misconduct." Smith added that Harvard has moved to "impose appropriate sanctions", without saying precisely what that meant.

&endquot

As an example of scientific misconduct, Hauser's case stands irrespective of whether the conclusions were right or wrong. I understand that Hauser's case was cited due to it's familiarity with the linguistics and cognition community.

Office of Research Integrity clearly excludes honest errors, and difference of opinion from it's definition of scientific misconduct.

Judging conduct or ethics is different from judging the merit of a paper (or a theory). The main point of the discussion here is the merit of the arguments presented in Rao, et al.'s papers. I believe that it is inappropriate to cite an example of scientific misconduct to make a point about the merit of someone's work.

It may appear that we are splitting hairs about parts of your opinion piece that do not contain scientific argument. But I feel, the allusion to Marc Hauser is a distraction that is important enough to affect the common perception of Rao, et al.'s work.

Your objection has been noted. On Hauser's case: we shall see what transpires. I am well aware of the allegations, but as you note, that's not the same as a formal charge.

It seems in any case that opinions of who is right and wrong here, and who is making fair or unfair arguments has largely already been determined. At least I am not aware of anyone having been persuaded to shift sides.

karatalaamalaka: you are correct to question what Richard constantly harps on, that there were no computational linguists among Science's peer reviewers. How does he know that?

And while we are on the subject, the contentious FSW paper was published not in Science or Proc Roy Soc or PNAS or Computational Linguistics, but in the "Electronic Journal of Vedic Studies", whose editor since the journal's founding in 1995 is... Michael Witzel!(Who even wrote an editorial in the same issue.)

If Farmer, Sproat and Witzel believed they had a convincing case, why didn't they try to publish it on neutral territory? Is it inappropriate for us to ask what sort of peer review FSW was subjected to? (Personally I don't care about that: I think post-publication influence is more important, and Sproat's citation of a journalist's comments above speaks for itself.)

If Sproat today thinks he has a problem with the published methods of Rao et al and Lee et al, why not write a scholarly paper on it, rather than a "Last Words" which is essentially an opinion piece?

Richard: you say You want me to do something on those [PNAS, PLoS ONE papers of Rao group] too?

Picking up on the above comments: If you can put it through the usual peer-review process and publish it in the same places they did, yes, I'd be very interested. PLoS ONE in particular shouldn't be hard: they only look for correctness, not importance or impact on field etc.

I don't understand what you mean by Hauser being "formally charged". Do you mean by law enforcement authorities? That doesn't generally happen in science. In the worst case that I can think of, the Schön case, Bell Labs set up an enquiry committee, and based on its findings, dismissed the scientist and, over the next year, practically all his published papers were withdrawn. His scientific career was dead. But there were no legal charges against him.

Harvard has been much less transparent in Hauser's case, but it sounds like their inquiry, too, has found Hauser guilty of several counts of scientific misconduct. What happens next remains to be seen, but he's not going to jail or appearing in court, if that's what you mean.

I have submitted a grant application to collect a set of corpora of man-madenon-linguistic symbol systems, investigate statistical methods that might helpdiscriminate between them and linguistic systems, and come up with an estimateof how much weight those methods provide. In other words, in terms of theinductive approach, I want to be able to quantify, given some value V for somemeasure, P(V|L) versus P(V|NL).

In your original blog Rahul, you discussed possible values for this, but thiswas all based on speculation. Rao et al never even discussed this, except forpresenting visually convincing plots. I made this point in my rejoinder to theirreply.

If I get the grant --- I'll just have to wait and see if I get it --- thencertainly I'll publish the results in a peer-reviewed venue.

I want to do something constructive. Surprised? Wish me luck.

But in any event, as I have argued all along, comparing a wide range of symbolsystems (and a large set of linguistic data from various languages and genres)is the only proper way to do this. That is what I intend to do.

And who knows? Maybe I will come to the conclusion that after all the Indussymbols look more linguistic by one of those measures. Then one would have toconsider that measure along with all the other kinds of evidence and see, onbalance, what seems the more likely model.

But ultimately I don't care what the conclusion is. I am interested in seeingthis done right.

Richard - certainly I wish you luck and I agree that these likelihoods can only be quantified properly with a large enough corpus. In some private correspondence with Rajesh, I had suggested that P(D|S) (as defined above, for which I arbitrarily picked 0.9) could be quantified by measuring their bigram entropies (or block entropies) for a large number of languages, calculating the mean and the standard deviation, and asking what is the probability that a new language will be closer to the mean than the largest outlier in the current group. "English words" is their biggest outlier, both in the Science paper and in their rebuttal to you; eyeballing the figure in their rebuttal estimates that it is two or more standard deviations away, so, assuming a normal distribution(*), P(D|S) is probably well over 0.9, while -- from that figure and assuming random distribution of the block entropy of a generic sequence -- we should probably take P(D|NS) to be about 0.2, not 0.1 as I said above. The conclusions don't particularly change.

But of course, if you can get a large corpus of non-language sign sequences, P(D|NS) can be estimated in a similar manner to P(D|S). And one can use arbitrary statistical measures.

So I agree that we can give this a break now, and I wish you luck with your grant and look forward to the results.

(*)Of course a normal distribution is unlikely, but in the absence of better information on the actual distribution, a normal distribution is the best assumption one can make (cf Jaynes).

Based on the above analysis, if we start with a prior belief P(S) = 0.5, then we obtain a posterior probability P(S | D) of about 0.9. If one starts off being highly skeptical, with say P(S) = 0.2, one still obtains P(S | D) is about 0.7. These values are similar to Rahul’s estimates from our Science paper plot but based on block entropies for block size 6 rather than bigrams.

I agree with Rahul that these estimates could be further refined using a larger variety of linguistic and nonlinguistic systems.

So I welcome your effort in this direction and wish you good luck on your grant application. I regard our work as only an initial step in the general research area of exploring statistical measures for characterizing linguistic and nonlinguistic systems. It would certainly be interesting to see how block entropy compares with other statistical measures.

I also agree with the general sentiment expressed here regarding the Indus script debate: it is time to give it a break.

For some reason, only half of Rajesh's comment made it above. His repeated attempts to post his entire comment show up in my inbox but not on the blog. Let me try pasting it for him.

Rajesh Rao's full comment read:

Richard: Regarding your comment: In your original blog Rahul, you discussed possible values for this, but this was all based on speculation. Rao et al never even discussed this, except for presenting visually convincing plots.

Our response in Computational Linguistics did include a paragraph exploring the implications of Rahul’s suggested method, using actual values from our data. We deleted it after the editor asked us to shorten our response; we included a reference to Rahul’s blog instead. The following is a quantitative analysis based on Rahul’s method that uses values estimated from our block entropy data.

Let S and NS denote the script and non-script hypotheses for a given data sequence, and let D denote the property that the data sequence’s block entropy falls in the same band as known linguistic systems (see Figure). If one knows nothing about the origin of the sequence, and assumes that its block entropy can fall anywhere between the Max Ent and Min Ent limits, then P(D | NS) is the chance the sequence will fall in the narrow band occupied by languages in the figure (for block size 6), which can be calculated to be about 0.1. To estimate P(D | S), one could, for example, consider the distribution of linguistic systems in the figure for block size 6, approximate it as a Gaussian, and calculate how many standard deviations x away from the mean the current biggest outlier is (for the data in the figure, this is “Sumer” (not “English words” as in one of Rahul’s replies)). We could then use for P(D | S) the probability that a new linguistic system will lie less than x standard deviations away. For the data in the block entropy figure, this procedure indicates that P(D | S) is approximately 0.93.

[from here on Rajesh's comment made it above, but I repeat it for continuity - RS]

Based on the above analysis, if we start with a prior belief P(S) = 0.5, then we obtain a posterior probability P(S | D) of about 0.9. If one starts off being highly skeptical, with say P(S) = 0.2, one still obtains P(S | D) is about 0.7. These values are similar to Rahul’s estimates from our Science paper plot but based on block entropies for block size 6 rather than bigrams.

I agree with Rahul that these estimates could be further refined using a larger variety of linguistic and nonlinguistic systems.

So I welcome your effort in this direction and wish you good luck on your grant application. I regard our work as only an initial step in the general research area of exploring statistical measures for characterizing linguistic and nonlinguistic systems. It would certainly be interesting to see how block entropy compares with other statistical measures.

I also agree with the general sentiment expressed here regarding the Indus script debate: it is time to give it a break.

Rahul- I don't think anyone is "missing" the importance of the figures. At the moment, they are not trying to decipher the text, only to determine its structure.

My comment: That is precisely the point, Rahul. The structure of the writing system is dominated by 'figures'. They should be reckoned-in, while figuring out, mathematically, a 'structure'. Maybe, both 'figures' and 'signs' are part of the structure. Why not?

Add my good wishes too for Richard's grant application. Reference to Hauser was a blow below the belt.