UNCERTAINTY IN RECONSTRUCTION.

Last month I mentioned Piotr Gąsiorowski’s series of posts on the origin of language; the latest post is so interesting I’m calling particular attention to it. Here’s the core of his conclusion:

The PIE reconstruction is a monumental intellectual achievement, and yet it isn’t “a language” that could be ascribed to any single speech community at any time. It’s a large set of coalescent reconstructions distributed in time and possibly in space as well. Other protolanguages, even relatively uncontroversial ones, are usually still more nebulous. If we ever manage to prove that the IE languages are related to some other established family, the reconstructed features of the common ancestor will naturally be even harder to constrain, and the protolanguage itself more elusive and fragmentary. It is hard to predict how far back in time our best reconstructive methods can take us before the notion od “protolanguage” becomes too vague to be meaningful. We can only resolve this question empirically, by putting our methods to extreme tests. If we consistently fail, it may mean that we have already reached the limit.

But if you’re interested in this stuff at all, you’ll want to read the whole thing. (That, by the way, is one of the main reasons I bailed out of historical linguistics; the uncertainty became too much for me.)

Comments

The trouble with historical linguistics (he said, speaking from profound ignorance) is that it’s sometimes pretty hard to imagine what sort of evidence might ever exist with which to test a hypothesis.
I suppose if a particular hypothesis is based on proposed patterns of migration then modern genetics might shed light eventually. But I assume that many hypotheses are essentially untestable and might always be so. In which case the whole thing degenerates into a game, and a pretty silly game at that.
Sometimes an intellectual adventure doesn’t work, or works up to a point and then reaches an impasse. But by then it will have accreted many vested interests.

If I remember correctly, Meillet compared the graphic symbols used to represent the phonemes of PIE to algebraic letters used to represent dummy numbers, emphasizing that the PIE symbols were nothing more than a shorthand for sound correspondences among attested Indo-European languages.

You can test the models and predictions of linguistic reconstruction if new evidence comes to light. For example, the dicovery of new archaic IE languages (Mycenaean Greek) and even whole branches (Anatolian, Tocharian) in the course of the 20th century led to the falsification of some features of earlier models and spectacularly confirmed others (e.g. the laryngeal theory). Meillet was too pessimistic about the possibility of reconstructing phonetic details, in my opinion. Quite a lot of subphonemic deatail can be recovered. We can’t confirm the reconstruction directly, since there are no material records of languages such as PIE, but neither can you prove “directly” that birds are descended from dinosaurs. It’s all informed guesswork and circumstantial evidence, but there’s enough of it from several independent quarters to amount to a very solid hypothesis, practically “a fact”.

You can also use the method commonly employed for training neural networks: segregate your data into two sets, use the “training set” to develop your hypotheses, and the remainder to test it. Of course, it helps to be a little bit split-brained.

Bill W: Meillet compared the graphic symbols used to represent the phonemes of PIE to algebraic letters used to represent dummy numbers, emphasizing that the PIE symbols were nothing more than a shorthand for sound correspondences among attested Indo-European languages.
At an early stage of reconstruction, you may have to use “dummy symbols” (eg “T” for the set of dental consonants, or “H” for the probable PIE laryngeals), but as the work progresses you should be able to get closer to more precise characterizations, just as you eventually find the values of x in the context of known values of a, b, etc in an equation. Shorthand symbols are not totally arbitrary either, and they are convertible into ordinary letters or sequences.
JC: segregate your data into two sets, use the “training set” to develop your hypotheses, and the remainder to test it.
This sounds like a good method if you have tons of data (eg comprehensive dictionaries), but if you only know a few dozen words of each language (not chosen for potential cognacy), or even a few hundred, you need to use them all.
I am not sure what you mean by “split-brained” in this context. I wonder if I am not a little bit that way myself.

I really hate the design of your website, hat.
I’ve been dealing with reconstructions of Early Chinese (for several meanings of “early”) for decades, and the field is as much of a mess as ever. At one time I owned something like 7 reconstructions of the language of 800 BC — 200 BC (already too great a range). Each used a different notation, so you couldn’t tell how they were the same and how different without getting deep into the linguistics, whereas I just wanted a usable handbook,
Schuessler’s recent works have given a usable compromise version and have corrected some known errors in Karlgren, but I have no confidence that he’s solved the problem, and he introduced new problems with his reconstructed morphology.
Besides the question of dialects (which are hardly discussed) and time change 800-200, literary languages tend to be koines and also tend to use archaic forms.

I am not sure what you mean by “split-brained” in this context. I wonder if I am not a little bit that way myself.
Perhaps not the best-chosen word; I mean that you have to pretend that you don’t know what you know about anything outside the “training set”, and work on the basis of that alone. Then you look at your reconstruction and see how it fits with the rest of the evidence. That requires — a single eye? Judicial detachment? I’m not sure what term to use.
Of course, this is all being done in the context of justification (showing that your reconstruction is sound) rather than the context of discovery (actually devising the reconstruction).I really hate the design of your website, hat.
Well, we know you’re no spammer, John; you don’t have to prove it to us.

The trouble with historical linguistics (he said, speaking from profound ignorance) is that it’s sometimes pretty hard to imagine what sort of evidence might ever exist with which to test a hypothesis.

I’m a phylogeneticist in biology. Anything can evolve any which way, so phylogenetic hypotheses ( = phylogenetic trees) are not strictly falsifiable. What do we do?
We use the principle of parsimony, Ockham’s good old razor, the other part of the scientific method. (And, actually, there’s always a lot of parsimony hidden in falsification.) We, or anyway our computer programs, count the steps (ideally the mutations) that our trees assume, and then we choose the hypothesis that requires the fewest assumptions.
The comparative method does the same, but usually it’s applied in a qualitative, eyeballing way; few historical linguists explicitly count their assumptions.

Note that this does not require that all mutations be equally probable or equally likely to survive a few rounds of selection. All the more sophisticated methods use models of evolution (the parameters of which are derived from the data!) because “nothing happens” isn’t a parsimonious assumption and because back-and-forth mutations happen.

More vaguely related to the topic, last week I read much of this paper which presents evidence that the Chukotko-Kamchatkan family is the closest known relative of the “isolate” Nivkh ( = Gilyak). The author is Michael Fortescue, who has previously connected Uralic and Eskimo-Aleut to each other and to Yukagir and Chukotko-Kamchatkan. One step closer to Proto-Nostratic? 🙂
Directly on topic, I can’t comment on Blogspot blogs that have “Name/URL” switched off, so I have to do it here:

the earliest amniotes split almost immediately into the “mammalian” and the “reptilian” branches

By definition the descendants of Bob the Basal Amniote immediately split into those two branches. The name Amniota has a node-based definition.

JC: I mean that you have to pretend that you don’t know what you know about anything outside the “training set”, and work on the basis of that alone. Then you look at your reconstruction and see how it fits with the rest of the evidence.
That certainly fits in with the way I try to work.

David: Too bad that Fortescue paper costs so much to access. I can’t even read more than a few words of the abstract before it gets hidden. I have Fortescue’s book Language relations across the Bering Strait (1998), which is very interesting and challenging but which sometimes jumps to conclusions without enough evidence. He also gives quite a number of correspondences in the Chukotko-Kamchatkan family (with resemblant data from other families), many of which I find doubtful. He also presents Proto-CK reconstructions done by a Russian linguist, without giving enough evidence for them (but presumably that evidence was given in a separate work). Since that book is now 15 years old, presumably there has been progress, including revision of the comparative data, but I don’t know any more. The trouble with the Paleo-Siberian languages (CK, Yukaghir, etc) is that they have been in contact for so long that it must be very easy to confuse contact phenomena with actual genetic relationship (a problem with “Uralic”, for instance). But since I don’t know those languages myself, I have to base my opinion on the methodology and on some extraneous relevant information, not on any personal acquaintance with the data.

I just sent it in your general direction. 🙂
Institutional access to scientific journals is sold in bundles. An institution often can’t buy access to one without having to buy access to several from the same publisher. This is why the Museum für Naturkunde in Berlin has access to linguistic and medical journals. It’s also part of why the big science publishers (Elsevier, Wiley, Springer, Informa) make profits of 35 to 42 % of their revenue per year.

a problem with “Uralic”, for instance

…Do you doubt the monophyly of Uralic? Or did you mean Kortlandt’s “Indo-Uralic” or suchlike?

Are all those R clerks really RINOs who hate America and freedom and stuff?

Well, yes… yes, spambot who just copied & pasted a comment from somewhere else and then didn’t manage to put a URL in, most of the people who still remain in the Reptilian Party really do hate America and freedom and stuff (except if the stuff belongs to them, or if they think it might one day belong to them). They know they can’t win elections anymore without rigging the whole system, so now they try to rig the whole system. Thank you for asking. 😐

OK, I know, but it sits on a very short stem, anyway.

Point is that the stem, by definition, doesn’t belong to Amniota.
(And how large the amniote stem-group is is quite a controversial topic. If my former thesis supervisor and I are right, it’s indeed small, containing nothing known but Diadectomorpha. Westlothiana is an amphibian, Casineria is… something else, I’m afraid of scooping people. If most other people are right, it’s huge, containing in addition the lepospondyls, the seymouriamorphs, Solenodonsaurus, possibly Gephyrostegus and Bruktererpeton… and maybe even the anthracosaurs, but that’s probably wrong either way.)

@David: Point taken (and any formal definition of IE would be node-based too, by the way), but my original point was simply that the two primary sister clades of Amniota are of comparable size and diversity. The asymmetry is usually greater, both in biology and in linguistics, since there is no compelling reason to expect parallel diversification on the same scale. What we see more often in the case of large language families is something like the Austronesian situation: nine or so tiny groupings in Taiwan, with the humongous Malayo-Polynesian clade originating inside one of them. And then again: inside Malayo-Polynesian we have one huge “nuclear” clade with lots of smaller “basal” groups.

dearieme, Trond: Someone linked to this article on facebook and I read it from there. Most hatters will not learn much from it, and the rest will probably be confused.
The article announces “Eurasiatic” (a hypothetical supergroup originally suggested by Joseph Greenberg of “Amerind” notoriety) as a major discovery by scientists (not linguists) but seems somewhat confused as to its relation with PIE. It also quotes a few words which the authors interpret as culturally too important to have been lost or replaced, so that they have lasted 15,000 years (one of these words is “bark” (of a tree) for which the authors propose an explanation). Like Renfrew and the Proto-World people, the authors do not seem to differentiate between the survival of a lexical item (although made unrecognizable through millennia of phonological changes) and the survival of the sounds that compose it (which are independent of the meaning of the whole word) (eg Renfrew et al thought that words for ‘nephew’ had endured almost unchanged for hundreds of years because of the importance of this concept, but actually this longevity is due to the fact that the consonants in the word had been more resistant to change than others, as shown by the behaviour of those consonants in other words totally unrelated semantically to ‘nephew’). In any case, the article does not cite any actual forms, only meanings. Read it at your own risk.

Re: Pagel, Atkinson & al. 2013. This is exactly the kind of approach which makes wishful thinking look like science and gets it past reviewers. Even if the numerical methods are basically sound, the data are garbage (obtained by the intuitive eyeballing of reconstructions from the Tower of Babel database — itself a highly questionable source — without any actual comparative analysis). No different from ordinary “mass comparison”, except perhaps for a tighter control of semantic matches.

my original point was simply that the two primary sister clades of Amniota are of comparable size and diversity. The asymmetry is usually greater, both in biology and in linguistics

Oh, yes, of course.

I’m already looking forward to the show.

I have access to PNAS in the museum, so I’ll contribute to the show!
In the meantime, do keep in mind that – as a rule – science journalists have no clue what they’re writing about; nothing they write can be taken at anywhere near face value.

as a major discovery by scientists (not linguists)

…I’ve noticed that “science” is often taken to mean just “natural sciences” in English. I find that disturbing.

The Tower of Babel project assumes the validity of long-range groupings (such as Nostratic, and larger) in advance, and their reconstructions are often tendentious. I can’t judge the quality of their Dravidian or Kartvelian stuff (other than by comparing them with the reference works I trust), but the IE reconstruction is basically Walde-Pokorny, simply unacceptable by modern standards for a variety of reasons. I realise that one has to start somewhere, but mass comparison is at best of some help in formulating preliminary hypotheses, and mass comparison based on unreliable data is no use at all.

scientists (not linguists)
David: …I’ve noticed that “science” is often taken to mean just “natural sciences” in English. I find that disturbing.
Well, I wanted to stress that (most if not quite all of) the people involved in the project were not linguists although they had other kinds of advanced qualifications. What would you suggest as an alternative?

scientists (not linguists)
David: …I’ve noticed that “science” is often taken to mean just “natural sciences” in English. I find that disturbing.
Well, I wanted to stress that (most if not all of) the people involved in the project were not linguists although they had other kinds of advanced qualifications. What would you suggest as an alternative?

@Marie-Lucie: It takes an interdisciplinary crew to smuggle such stuff into serious journals. You have to razzle-dazzle “the scientists” with a tantalising linguistic analysis, and to intimidate “the humanists” with the maths.

PG: You have to razzle-dazzle “the scientists” with a tantalising linguistic analysis, and to intimidate “the humanists” with the maths.
Excellent summary. Can I quote it?
I don’t know anything about their maths and statistics and therefore find the paragraphs about those things unreadable, and I can see that people equally ignorant of historical linguistics could be “razzle-dazzled” by the apparent expertise of the authors.

I’ve read half of the paper now. The only big flaw I can find is the headline: the paper does not try to establish the existence of a Eurasiatic clade, it takes it for granted. It proposes a dated phylogenetic tree for the intrarelationships of Eurasiatic; I’ll explain how it does that later (and I have yet to find out how they rooted it).
However, “garbage in, garbage out” applies to everything ever. The Starling database should clearly be vetted by more people; using Pokorný’s PIE reconstruction is a decidedly odd thing to do (maybe it was the best available when Vladimir Illich-Svitych started his work, and the rest is inertia, I don’t know). The paper itself wasn’t peer-reviewed either; it’s a “direct submission” with a prearranged editor, and the extremely short acknowledgments only mention funding sources, no reviewers. No wonder the map (fig. 1) is such facepalm-inducingly sloppy work.
On the other hand, it’s important to remember that the cognacy judgments in the Starling database are not mass comparison! Those people do propose a large set of regular sound correspondences and use them to test their cognacy judgments before they upload them. It’s possible that their morphological analyses need more work, it’s possible that the input data is partially garbage, but it’s not mass comparison.

Well, I wanted to stress that (most if not quite all of) the people involved in the project were not linguists although they had other kinds of advanced qualifications. What would you suggest as an alternative?

Just say what they are: biologists, psychologists…
My point is that the scientific method is the scientific method. It doesn’t differ between “hard” and “soft” sciences, between “natural” and “social” sciences or whatever. To imply that linguistics (for instance) is a field where hypotheses aren’t testable even against parsimony, or where the occasional untestable hypothesis is OK, disturbs me. I’ve read my sister’s thesis on comparative literature (vergleichende Literaturwissenschaft), and it tests hypotheses all the time.

It takes an interdisciplinary crew to smuggle such stuff into serious journals.

That would be true if the paper had been peer-reviewed. But this way, the authors only needed to convince Colin Renfrew.

David: To imply that linguistics (for instance) is a field where hypotheses aren’t testable even against parsimony, or where the occasional untestable hypothesis is OK, disturbs me
I see why this implication would disturb you, but I certainly don’t mean to imply that about linguistics. In my own work I think I do apply scientific principles of hypothesis, testing, supportive evidence and standards of acceptability, as opposed to some others’ opinion that “if it is not immediately obvious, it is forever unknowable”, or words to that effect.
What source would you recommend for PIE data?

I have read the paper (not very carefully yet, I admit), and I am appalled by the qualkity of the linguistic input. My impression is that the judgement on what is a “match” is based entirely on the ToB list of Nostratic etyma. See this entry:http://starling.rinet.ru/cgi-bin/response.cgi?single=1&basename=/data/nostr/nostret&text_number=1582&root=config
It gives the impression that the “ti”-type 2sg. pronoun has valid “cognates” in all the seven “Eurasiatic” branches in question, which is nonsense. For example, Dravidian has only a set on /n/-pronouns in the second person. Using a questionable verb ending instead is kind of cheating. It’s only one example of the tendentiousness of the database. I have to run now, so I can’t write about it at length now. I’ll be back later.

Re: Euirasian ‘thou’ (continued)
The Kartvelian 2sg. pronoun is *sen (*tkwen is plural!). The Proto-Eskimo non-clitic 2sg. pronoun should more properly be something like *ǝɫ-vǝn-t, where the final dental is supposed to match other “M-T” pronouns (especially if we agree that anything that contains a dental obstruent is a match). The Proto-Chukotko-Kamchatkan 2sg. pronoun is *kǝð (the t-initial forms are plural). Uralic t-forms (*tun, *tina) are OK. Even if we accept “Altaic” *si-(/*ci-, *ti-?) as part of the trans-Eurasian M-T system, we are left with something closer to 4/7 (with complications and controversial details) than 7/7.
There are of course problems with the content words cited in the article as well, including ‘bark’ (I assume they mean the *ḳerV- set of equations, since no other ‘bark’ etymon in the ToB yields four trans-family “cognates”). Perhaps I’ll cover that on my blog one day as an example of sloppy comparative practice. The only really good match there is between IE/Uralic, but since the IE forms are quite transparently deverbal (from *(s)ker-, *(s)kert- ‘cut’), ‘skin, bark’ can hardly date back to the Palaeolithic, and the Uralic cognates may actually be loanwords (with IE-looking vowel alternations).

Historical phonology and etymology are dependent on each other and thus in constant interaction. The phonology of a proto-language is first reconstructed on the basis of the phonologically and semantically most transparent etymologies(and thus the most likely ones to be correct), a process in which some of the etymologies may have to be discarded; then, the validity of the semantically and phonologically non-transparent correspondences is reassessed and the emerging new etymologies are evaluated on the grounds of the proposed sound laws, the new cases possibly allowing the correction of sound laws or the postulation of new ones.

This is in fact an iterative process exactly like the one that goes on between publishers and lexicographers of a standard language: each group obtains the orthographies used/mentioned in their works by examining the works of the other group.

Trond, then you would love Native American languages, for which there is a lot of basic sorting out to do! (but unfortunately, for most of them the word supply is limited, and there are no new words being coined).

LH: one of the main reasons I bailed out of historical linguistics; the uncertainty became too much for me

The challenge, and the fun, of historical linguistics are the same as with dealing with any complex puzzle. You start with uncertainty, you try hypotheses, and little by little you try to build on positive results. When two or three kinds of independent hypotheses confirm each other, you are on your way to certainty. When hypotheses clash, you are on the wrong track with at least one of them, and should try something else. Sometimes what seems obvious at the beginning turns out to be spectacularly wrong (word similarity is often in that category), and something else which seemed very unlikely turns out to be a step in the right direction or even the key to the solution to (at least a piece of) the puzzle. But you can’t be dogmatic and assume that this or that is either possible or impossible.
Another problem is that many people expect instant, large-scale results, such as “proving Nostratic” and the like. Instead, we are more likely to spend a long time chipping away, like an archeologist in a trench, using a toothbrush to separate artifacts from the surrounding soil.

Unfortunately, as with other kinds of science, textbooks tell you the results (such as the “laws” of physics or Indo-European), not the ways those results were arrived at, much less the combination of grinding work, disappointments and excitement that led to those results.

More from Ante Aikio (who is now publishing under the name of Luobbal Sámmol Sámmol Ánte):

This small study illustrates how historical phonology and etymology depend on one another and are in constant interaction. In comparative work over a great time depth, such as between Hungarian and Proto-Uralic, an apparently circular problem is commonly encountered: on the one hand, we cannot know which etymologies are correct if we do not also know what the regular phonological developments of the language in question are; on the other hand, we cannot solve which developments are regular if we cannot decide which etymological comparisons to trust. However, there is a way out of this problem: hypotheses of historical phonology can be tested through etymological research, and vice versa. […]

In this paper three new sound laws regarding the development of consonant clusters in Hungarian were proposed, and each of them was corroborated by evidence from new Uralic etymologies for Hungarian words. This shows that our understanding of how Hungarian developed from Proto-Uralic can still be improved, as long as research is carried out in a theoretically and methodologically sound comparative framework. The new sound laws and etymologies proposed here could be detected by a rigorous application of the comparative method, with the underlying assumption that sound change operates in a regular manner. Undoubtedly, meticulous research in this direction by specialists in Hungarian historical linguistics could produce many more findings in the future.

Note that he discounts the generally accepted close relationship between Hungarian and Ob-Ugric, and so reconstructs Hungarian directly to Proto-Uralic.

And you can support my book habit without even spending money on me by following my Amazon links to do your shopping (if, of course, you like shopping on Amazon); I get a small percentage of every dollar spent while someone is following my referral links, and every month I get a gift certificate that allows me to buy a few books (or, if someone has bought a big-ticket item, even more). You will not only get your purchases, you will get my blessings and a karmic boost!

Favorite rave review, by Teju Cole:
"Evidence that the internet is not as idiotic as it often looks. This site is called Language Hat and it deals with many issues of a linguistic flavor. It's a beacon of attentiveness and crisp thinking, and an excellent substitute for the daily news."

From "commonbeauty"

(Cole's blog circa 2003)

All comments are copyright their original posters. Only messages signed "languagehat" are property of and attributable to languagehat.com. All other messages and opinions expressed herein are those of the author and do not necessarily state or reflect those of languagehat.com. Languagehat.com does not endorse any potential defamatory opinions of readers, and readers should post opinions regarding third parties at their own risk. Languagehat.com reserves the right to alter or delete any questionable material posted on this site.