Thursday, December 11, 2014

In the latest issue of the International Journal of American Linguistics, Cecil Brown, Soren Wichmann, and David Beck announce a rather interesting finding: that Chitimacha [is] A Mesoamerican Language in the Lower Mississippi Valley. I don't know much about any of the languages involved, but insofar as I can judge it, it strikes me as quite convincing. They find 91 cognates between Chitimacha, a language of southern Louisiana, and Totozoquean, a language family of southern Mexico consisting of Totonacan and Mixe-Zoquean. Most of these cognates are very straightforward, with identical meanings and obviously similar, regularly corresponding sounds, and 36 of them involve words basic enough to be on the 100-word Swadesh or Leipzig-Jakarta lists. The grammatical similarities are rather less extensive, but there are a few. So, pending other specialists' comments, it looks like Chitimacha was brought to Louisiana by a migration across the Gulf of Mexico, from somewhere around the Isthmus area.

There is some useful shared cultural vocabulary, including "paper", "to write", "lime", "maize (corn)", "leached corn", and "to shell corn", and it looks like Caddo - spoken just upriver - in turn borrowed much of its maize-related vocabulary from Chitimacha. In combination with archeological evidence, this leads the authors to favour a migration date either some time around 850 AD, when the Caddo began low-level maize cultivation, or sometime around 1200-1450 AD, when they intensified it. Such a late date seems a little troubling, given how few cognates are to be found; Korandje separated from Songhay around 1200 AD, and there are well over 200 shared items there, mostly belonging to basic vocabulary. The ancestor of Chitimacha would have to have already been rather different from any other Totozoquean language even before they reached Louisiana; but then why did they apparently leave no trace in Mexico itself? Perhaps a study of southern Mexican place names could shed some light on the question.

This looks like historical linguistics at its best: a surprising long-distance connection affecting both language and culture. Now it's up to the historians and archeologists to fill in the gaps: why did southern Mexicans find it worth while to cross the Gulf to Louisiana in significant numbers?

Sunday, November 30, 2014

People tend to enter their first linguistics classes with a vague but strongly felt idea, instilled by English teachers or by society at large, that some ways of speaking are bad, illogical, sloppy, rule-breaking, etc. One of our first tasks is thus to explain to them that, actually, such ways of speaking are just as logical and law-governed as standard English, they're simply obeying a different set of rules. Not infrequently, we follow that up by telling them everything that's wrong with the prescriptive rules of Standard English, based ironically on a very similar set of tropes: they're illogical (stop splitting infinitives because you can't do that in Latin), they're historically inaccurate (don't use singular they even though the King James Bible does), they're incompatible with the rules of modern spoken English (eg "it is I") to the point of confusing them into gross solecisms ("they gave it to John and I"). Unless we're careful, the students end up walking away from all that with the impression that linguists think prescriptivism is bad, full stop. That, however, would be a mistake. As irritating as these problems and misconceptions are, they don't affect the case for having a prescriptive standard language - just the extent of its ambitions and the details of its usage.

Prescriptivism, of course, is all about power: who gets to talk how where, and who gets to say how they should talk. As good libertarians, our first reflex might be to say that this is all unnecessary: let everyone decide for themselves! That has two different problems. The first is that, when people decide for themselves, what they end up with is in fact a set of implicit rules for what's appropriate in which circumstances, and if you want to make life easier for visitors from other cultures, the least you can do is make those rules explicit somewhere. The other is that, in the event of any clashes, it's the more powerful individual that gets to decide, which is a particular problem in the case of public services. You want a driver's license, and you only speak English? Sorry, our local transport officials aren't really comfortable with English, so you'd better brush up on your Russian.

The latter example may sound like fantasy to American or English readers (not so much to the Irish or Welsh), but it's rather close to reality in a lot of the world. If you understand Arabic, have a look at this video of Moncef Marzouki, one of the two current presidential candidates in Tunisia, having a go at his Tunisian interviewer for using too many French words:
"Respect the Arabic language! Plutôt, what does plutôt mean? You say plutôt, what's that? My sister in Douz won't understand plutôt. [...] [Interviewer: It's a chance for her to learn...] No, she needn't learn - you learn the language of Tunisians!"

It's populism, of course - but, like a lot of populism, it makes a good point. Why the heck should the average citizen have to speak a foreign language to deal with officials and other elites in his/her own country? (Especially in one as close to monolingual as Tunisia?) In such a situation, if the populace doesn't prescriptively impose their language preferences through concerted action, the bureaucracy will simply impose their own in one-to-one interactions.

Thursday, November 27, 2014

Kamal Nait-Zerrad's 2001 article "Esquisse d'une classification linguistique des parlers berbères" presents a good deal of useful data, but does so in a manner that I find makes it rather difficult to figure out what's going on without plenty of pencil work. In case anyone else has the same experience, here's my take on it. I will not focus on, or even necessarily present, his interpretation here - read the article for that; rather, I'm more interested in figuring out the implications of the data he presents in the light of other work before and since, and in the light of accepted principles of historical-comparative linguistics.

First, he looks at a number of morphological and phonetic isoglosses:

1. The 3rd person singular preterite of CC verbs: yərra vs. yərru. Following Kossmann (2001), we now know that these are actually CC+glottal stop, so the data exemplifies two different sound changes: the relatively trivial *-aʔ > -a, and and the more surprising *-aʔ > o > u. The former is the commonest outcome; the latter is exemplified by: Ait Seghrouchen, Figuig, Beni Snous, Bissa, Timimoun, Mzab, Ouargla, Nefusa. (Ghadames still has o).

2. The proximal demonstrative suffix: -a vs. -u. Again, -a is the default, but -u appears in the same set of varieties as seen in 1, plus one more: Iznasen.

3. The 3rd person singular aorist of CCV verbs: ad yəbḍu vs. ad yəbḍa. Here, -u is the default, and is closer to the original, while -a has spread from the preterite. This applies to the same set of varieties as 2 (excluding Nefusi), plus several more: Rif, Metmata, Chaoui, Jerba.

4. Initial vowel dropping: a- vs. 0-. A number of *(t)a-CV-initial nouns drop the original vowel of the prefix in the same set of varieties as 3, plus Nefusi, Chenoua, and Siwa.

5. Velar softening: in many varieties, in many words, what would elsewhere be k/kk/g/gg corresponds to c/čč/j/ǧǧ. The latter outcome is observed in the same set of varieties as 4, minus Nefusi.

6. Final *-əv: this is retained as such in Ghadames and Awjila, and as constrative length in Zenaga. Otherwise, it becomes -u in most varieties, but -i in the same varieties as listed in 4, plus El-Fogaha (with a few question marks where the author had insufficient data). Cf. Kossmann (1995).

All of 1-6 pick out Zenati varieties, but the exact set differs: 1-2 pick out a core Zenati consisting almost entirely of northern Saharan varieties, while 3-6 pick out a broader Zenati including the semi-arid mountainous lands stretching from the Rif to southern Tunisia, and vary in their inclusion of varieties further east (Nefusi, El-Fogaha, Siwi). Chaker (1972) cites 1-2 and 5 as possibly justifying a Zenati subgrouping, while Kossmann (1999) defines Zenati in terms of 3, 4, and one other morphological innovation, and then cites 5 and 6 as common phonological innovations.

7. Negative intensive theme: retention/loss. The negative intensive is retained in northwestern Morocco (Rif, Iznasen, Senhaja, Ait Seghrouchen, Figuig); in Bissa; in Tuareg and in the nearby oases of Mzab, Ouargla, and Ghadames; and in Jerba. Its loss everywhere else (according to his data, which should be re-checked) shows no prominent genetic patterning, and hence is probably relatively recent.

Then, he moves on to vocabulary, examining 11 lexical variables which I would summarise as follows:

Several forms appear specifically Zenati: irəḍ in the sense of "be dressed" (though it is more widespread in other senses), igur for "go", əɣs for "want", azəgrar for "long", anilti for "shepherd". Of these, El-Fogaha and Siwa share only əɣs for "want", whereas Nefusi shares all except "go in". adəf "go in" is Zenati-specific in the west, but more confusing in the east, being attested in Ghadames and (as an alternative to əggəz) in Air Tuareg.

A couple of forms unite Libyan varieties with Tuareg, contrasting with Algerian and Moroccan varieties, in defiance of any plausible genetic classification, reminding us that a tree does not tell the whole story here:

Apparently, to get this he operated by successively applying at each stage the criterion from his list that divided the data into the lowest number of groups possible, without attempting to distinguish innovations from retentions, much less judge the relative likelihood of independent innovation. The fact that even such a crude method was still able to produce a recognisable Zenati subgroup either says something about the robustness of this distinction or about the selection of features. What this data set actually tells us, bearing in mind that shared retentions have no implications for subgrouping and that Zenaga fails to participate in a number of innovations that otherwise seem pan-Berber or nearly pan-Berber, is something quite different:

There is definitely a Zenati subgroup, as has been known at least since Destaing (1915), but its boundaries are a bit fuzzy. (If this reminds you of the situation of "Hilalian" g-dialects, that's probably not a coincidence.)

Transitional (the High Plateau and its edges): Rif, Metmata, Chaoui, Jerba

Peripheral:

Chenoua (north-central Algeria)

Nefusi (northwestern Libya)

Eastern Zenati (Libya/Egypt): El-Fogaha, Siwa

There is definitely a Tuareg subgroup, as has always been known: Ahaggar, Iwellemmeden, Air, Taneslemt.

There just might be a subgroup combining Kabyle with Senhaja, Central Morocco and Shilha: they share the innovation *-əv > -u, and the word awtul "hare". The evidence for it is very weak, though, especially since *-əv > -u is also found in some Tuareg varieties.

The rest of the common features almost all look like shared retentions.

Diachronically, agreement commonly emerges from clitic doubling, which in turn derives from topic shift constructions (Givón 1976) – a grammaticalisation pathway termed the Agreement Cycle. For accusatives, at the intermediate stages of this development, doubling constitutes a form of Differential Object Marking, and passes towards agreement as the conditions for its use are relaxed to cover larger sections of the Definiteness and Animacy Scales. Berber, a subfamily of Afroasiatic spoken in North Africa, shows widespread dative doubling with substantial variation across languages in the conditioning factors, which in one case has developed into inflectional dative agreement. Examination of a corpus covering eighteen Berber varieties suggests that low Definiteness/Animacy datives are less likely to be doubled. However, since most datives are both definite and animate, these factors account for very little of the observed variation. Much more can be accounted for by an unexpected factor: the choice of verb. “Say” consistently shows much higher frequencies of doubling, usually nearly 100 per cent. This observation can be explained on the hypothesis that doubling derives from afterthoughts, not from topic dislocation.

Sunday, November 02, 2014

Today, just for fun, I'd like to invite you to discuss a topic a little off the beaten track for this blog: how much linguistics should a high school graduate know? The question may seem bizarre - there have been occasional efforts to introduce linguistics courses into high schools (MIT, Milwaukee), but you don't expect to see "linguistics" on a high school curriculum. Still, let's not get confused by labels. Linguistics is inextricably woven into language teaching, and even the most resolutely monolingual curriculum includes at least the school's own language. (I recently happened to come across an 8th grade final exam from 1895 from Kansas; no foreign languages were featured, but no less than two out of the six subjects tested, Grammar and Orthography, rely heavily on linguistic concepts.)

One useful way of separating linguistic education from language education is to look at universality. Some of what you learn in English class is useful across practically all languages, like the idea of a verb or of a vowel. Some of it is much more parochial; the fact that the plural of "child" is "children" is a historical accident relevant only to English and, at best, its closest relatives. Such parochial facts can be vital, of course; if you're going to grow up in an English-speaking country, you'd better be able to form your English irregular plurals correctly. But the more general concepts have a deeper interest; they help you analyse what you're saying, and make it easier to learn new languages. Unfortunately, those concepts are precisely the ones that have suffered most in recent decades. In the UK, at least, my own experience suggests that most high school graduates can't even reliably tell a noun from a verb. In theory, the latest changes to the English syllabus should change that - but given that many of the teachers were hardly taught any grammar either, one wonders how successful the reform will be.

In any case, if I were designing a syllabus, here is what I would suggest to start with. I'd be interested to see what other linguistically oriented people think:

Phonetics has never been a focus of early education, apart from the minimum necessary for teaching a child to read and write (and even that gets de-emphasised in some approaches). This is a shame, because the younger you are, the easier it is to learn to hear and pronounce unfamiliar sounds. Why not learn:
- The IPA, or at least the most commonly used symbols in it; be able to pronounce and recognise them. This should include tone if at all possible.
- Basic articulatory phonetics: how the configuration of your vocal organs relates to the sound produced, and how to use this knowledge to pronounce unfamiliar sounds. (If your language uses Devanagari, you should have an advantage, as this is practically built in to the alphabet anyway; students of tajweed too will come across this issue at some point.)
- Phonology: the concepts of the phoneme and of conditioned allophones. That way when you learn another language you'll at least know why some sounds give you so much more trouble than others.
- Metric structure: syllable, foot, etc. (Yes, I know the concept of syllable is controversial, but you'll need this to be able to study poetry anyway.)

Morphology is a lot more language-specific than the other topics here, but one should at least know:
- How to decompose a word into its component morphemes (prefixes, suffixes, templates, roots...), and guess its meaning from them if necessary.

Syntax: Unlike phonology, this has traditionally been deliberately taught, and you should certainly know:
- The parts of speech: noun, verb, adjective, preposition, etc... and how to tell them apart.
- Argument structure and case: subject, direct object, nominative, accusative, etc.
- How to to break down a sentence into its phrase structure: what modifies what? What is a phrase, and what is its head? For best results, try being able to diagram it.

Unfortunately, it's not quite so simple: all three of those - especially the latter - are the subject of major controversies between different syntactic theories... (Two good Language Log posts on this issue: parts of speech and sentence diagramming.) If you teach whatever theory happens to be traditional where you're from, you may not make any friends in academia, and you risk perpetuating some old misconceptions; but you will certainly leave your students much better prepared to learn any more current theory - or any language - than if they had studied no grammar at all.

Historical linguistics and sociolinguistics: The language you speak most likely has relatives, and certainly contains words borrowed from other languages. You should understand:
- That there is normally variation inside a single language, which people often use to signal their social position and to identify the social position of others, and over which people's control is limited.
- That languages change over time as some variants become obsolete and others emerge, and in what ways they change - sound shift, semantic shift, borrowing, morphological and syntactic change...
- That different changes accumulating in different areas can split what used to be one language into several, and that people can abandon one language and start speaking another one instead.
- That sound shifts are usually regular, and that this regularity can be used to identify potential cognates (making it easier to learn languages related to ones you know.)

There should certainly also be some semantics and pragmatics in this list, but I'm not feeling especially inspired on either subject at the moment - any thoughts?

Thursday, October 30, 2014

I'm almost three-quarters of the way through Heath's Grammar of Tamashek (Tuareg of Mali). The main interest lies in its efforts to reduce the bewildering complexity of Tuareg morphology to some sort of order, an impossible task which it accomplishes more successfully than any other Tuareg grammar I've looked at so far. Aside from this, however, it's raised some interesting etymological issues.

I've wondered for years where the Korandjé verb wəy "gather (firewood)" comes from. It normally appears in the idiom a-wwəy-ts skudzi [3Sg-gather-hither wood] "she gathered in firewood". On p. 333 of Heath's grammar, I found the explanation, in the following example:

The Tamasheq verb in question, awəy in the imperative, is simply the normal Berber word for "take, bring" (which in Korandjé is expressed with a Songhay verb, zəw), so I would have hesitated to connect them based on a dictionary entry alone. But given this attested usage with "firewood", the semantic specialisation poses no problems. What does surprises me is that it was borrowed as a bare stem, rather than with a fossilised 3rd person prefix y/i - contrast yəf (Tashelhiyt y-arf "roast", not attested in Tamasheq), ikna "make" (Tamasheq i-kna). Usually, only stems that start with a syllabic onset are borrowed into Korandjé without the y/i.

Another probable loan into Korandjé that I noticed going through the grammar is Korandjé ləwləw "shine, gleam" - cp. Tamasheq m̀ələwləw "shine".

However, a number of words have gone the other way - from Songhay into Tuareg. Heath comments on many of these in his dictionary (eg kə̀rikəw "practice sorcery"), but not all. One that struck me is the verb ḍùkr-æt "become angry at", obviously related to Gao Songhay dukur "be angry"; I don't recall seeing this verb elsewhere in Berber (not even in Alojaly's dictionary of Tamajeq), whereas it's widespread in Songhay.

Obviously cognate are Tamasheq é-tæqq "male ostrich" and widespread Songhay forms such as Gao taatagey, Fulan Kirya taataɣey "ostrich" (the shift of g to ɣ next to non-high back vowels is regular in several Songhay varieties, and in Tamasheq qq is the geminate equivalent of ɣ). The word is generic in Songhay but specific in Tuareg - the opposite of what we saw with "bring" - which suggests to me that it was borrowed into the latter, as does the fact that I don't find the term in Alojaly's Tamajeq dictionary. However, since ostriches are extinct in most Berber-speaking areas, it's difficult to prove the direction of borrowing.

By and large, this appears very plausible, although it should be noted that Tunisian Berber and Zuwara are already somewhat peripheral to Zenati, not sharing western Zenati's innovative distribution of initial vowel dropping, and El-Fogaha is even more so than Siwa or Sokna. (As he notes, the much greater homogeneity and clearer boundaries of Zenati in the west imply that this group arrived in Algeria and Morocco from the east.) But, in principle, it is still necessary to identify specific innovations characteristic of each of these groups. It is also clear that the Zenaga block is by far the first split on the tree, and the list ought ideally to reflect that. But the moderately high degree of mutual intelligibility poses serious obstacles to applying the family tree model to Berber, as he discusses.

The most interesting Kabyle varieties for historical reconstruction are the little-known ones of the extreme east, "Tasahlit". As it happens, Abdelaziz Berkai has just uploaded his recent thesis, a dictionary and sketch grammar of the Tasahlit of Aokas: Essai d’élaboration d’un dictionnaire Tasaḥlit (parler d’Aokas)-français. The quality of his work appears excellent, and this will no doubt be a very useful resource. The choice of dialect, however, is not entirely ideal. It is clear from Basset's dialect atlas, and from the all too rare comments in Rabdi's grammar on neighbouring varieties, that the vocabulary of Aokas is still quite close to that of Bejaia; the really divergent varieties seem to be those of the Babor Mountains and Oued el Bared, approaching Jijel, and those are the ones most likely to give an insight into the dialect of the now largely Arabised Kutama.

Tuesday, October 21, 2014

Going through Brahim and Bekir Abdessalam's brief grammar of Tumzabt Berber (الوجيز في قواعد الكتابة والنحو الأمازيغية "المزابية": الجزء الأول) recently, I was struck by their discussion of the problem of subject-verb order. Berber in general allows both verb-subject and subject-verb order, with the case ("state") of the subject depending on which order is used. Determining which order is used under which circumstances, however, poses some difficulties; the same language may be described as VSO or SVO, depending on who you ask, and the determining factors certainly differ from one variety to another (cf. eg Mettouchi fc for Kabyle). Their take on the problem combines information structure with pragmatics and verbal mood. The latter two factors can very likely be reduced to information structure too, but that would require testing; in any case, the observation that VS order is required for serialization is interesting. Here's what they had to say, translated into English (pp. 129-130):

We observe that in the first set of examples, the subject precedes the verb; this is the usual form in an Amazigh clause consisting of a verb and a subject.

In the second set of examples, the subject follows the verb. This happens in the following cases:

The subject may follow the verb when it is specific and known to the speaker and listener because there is a connection between speaking of it and a previous expression involving speaking of the same subject. For instance:

twelleh! afunas-nni yetthaḍa - Watch out, that bull rampages.

After the two parties have parted, they meet again the next day, and one says to the other:
yak yhaḍ ufunas ay-tessečned asennaṭṭ! - Indeed that bull you showed me yesterday really did rampage!

Here, the subject - the bull - is specific for both parties to the conversation in the second usage, since it had been spoken of earlier.

For the sake of irony, which can only be deduced from the context surrounding this expression and from the circumstances of discourse, eg if we say:

They follow this up with an observation that seems quite astonishing from a comparative Berber perspective (p. 131):

A subject following the verb is put in the construct state if definite, this being the normal case for the postverbal subject, and is put in the free state if indefinite without any need for the [indefinite] article iggen / igget ["one"].

Sunday, September 14, 2014

Similarities between different languages are data. It's easy to come up with any of several wildly different measures of such similarities, typically by applying edit distances to wordlists (as in the ASJP*) or texts, but the result should not be mistaken for an analysis - it's just a measurement, a compression of the data. It doesn't tell you anything about the causes of these similarities on its own. Historical linguistics is not the measurement of similarities, but the effort to find the hypothesis about past events that best explains them. Your H0, of course, is always "coincidence". Once you've rejected that, you're left with the trickier task of disentangling contact from common ancestry - trickier because, quite often, they partially overlap.

To understand linguistic causation in the past, an essential starting point is to look at it in the present. Suppose that you are a native speaker of English:

If you say "football" or "garage" to your child while speaking English, it's because you grew up speaking English, and you know that this is what other English speakers say. The fact that French speakers happen to call it "football" too, if you're even aware of it, has nothing to do with your choice of words.

If you say "football" or "garage" to your child while speaking French, it's because you later studied French, and you know that this is what French speakers say. The fact that it's also what English speakers say no doubt made it easier to memorise, but if French speakers had named them something else, you would be doing the same.

We thus see that, for shared words, inheritance from either of two radically different languages can yield precisely the same outcome. The fact that English and French share these words in the first place is obviously due to contact (in each direction). The fact that your child is growing up with them, however, is because you're faithfully passing on the existing norms of one or the other language, not because you're combining them. In historical linguistic jargon, the use of the word "football" is at this point being inherited, not borrowed. Thus, if an English-monolingual Cajun says "stupid", it's not because he's managed to hold on to his ancestors' French word "stupide", it's because that happens to be the English word for it.

So, if we have a word in language A, and find the same word in two potential source languages B and C, we can't determine which it came from by looking at which language was spoken in the area earlier, or which was spoken by the speakers' ancestors. We can only determine which it came from by determining which language (if either) was transmitted as a whole, and the evidence for that can only come from forms that aren't shared between B and C. I leave the application of this to Levantine ʕāmmiyya as an exercise for the reader.

* It's beating a dead horse at this point, but: this Automated Similarity Judgement Program? It, too, finds that Levantine is way closer to Standard Arabic than to Aramaic, just like any historical linguist could have told you from the start.

Saturday, September 13, 2014

Everything I've been saying for the past 3 posts is basic textbook stuff, reflecting a stable consensus among Semitic historical linguists over, oh, the past two centuries or so. Why, then, is this zombie hypothesis that Levantine Arabic comes from Aramaic still popular in parts of the Levant? That's no great mystery: it comes from a more general movement to emphasise Levantine (and especially Lebanese) culture's continuity with the pre-Islamic Levant, and downplay the influence of Arabs. (Similar efforts have been made in North Africa, notably Abdou Elimam). As far as I can tell, the unstated reasoning goes something like this:

Levantines are descended from the Aramaic-speaking natives of the land, not from Arab immigrants.

Levantines' language contains a lot that sounds like Aramaic.

Therefore, Levantine is a continuation of Aramaic, not of Arabic.

Step 3, of course, does not follow from Steps 1 and 2. Step 1 is irrelevant to the whole question; the language of your ancestors is very often not the ancestor of your language (ask any Irishman, or any Egyptian). Step 2 is necessary but insufficient for getting to Step 3, since the statement is just as true of Classical Arabic - or of Akkadian, or Ethiopic - as it is of Levantine; we've already seen that deciding linguistic ancestry requires a more sophisticated toolkit.

Nevertheless, this impulse to emphasise continuity and downplay movement deserves more attention. In the Arabic-speaking world, the conspicuous problems with the existing political and economic order, and the humiliating contrasts between the ideals of pan-Arabism and the reality of closed borders and unchallenged occupations, provide an obvious local motivation to downplay Arab identity, and language is so central to pan-Arab identity that it could hardly be left unchallenged. But the impulse is not unique to the region; in some respects, it faithfully reflects wider intellectual trends of the late 20th/early 21st century.

During this era, immediately following some of the largest migrations and invasions in human history, many archeologists and historians have come to feel more and more uncomfortable with the very idea of either. Changes in material culture previously seen as the result of migration were re-explained as diffusion or independent innovation, and reports of barbarian invasions were reinterpreted or dismissed. In some ways, this has been a useful corrective to a previous era's overemphasis on migration; it has arguably made linguists more conscious of the familiar fact that language shift does not necessarily imply invasion, much less population replacement. In others, its influence has been rather less helpful. Linguists reached the late 20th century with a well-tested toolkit for studying the origins of basic vocabulary and morphology, its predictions spectacularly confirmed by such discoveries as laryngeals in Hittite and labiovelars in Mycenaean Greek. Applying this to most Old World languages, and many American or Australian ones, yields a story of discontinuity (be it through language shift or population replacement) that would be familiar to any 19th-century philologist, but that grates somewhat on postmodern ears. Of course, the same toolkit often allows us to detect substrata - elements left over from the population's previous language after they shifted to another one - but that's not enough to satisfy everybody.

A few linguists have responded by trying to change the rules of the game, insisting that the origins of a language should be determined not by vocabulary and morphology, as is normally done, but by purely structural features. This is an important component of Wexler's generally rejected claims that Yiddish is non-Germanic (and that Modern Hebrew is non-Semitic), and is the very essence of Lefebvre's somewhat more popular claims that Haitian Creole is just relexified Fongbe (and almost anything else with "relexification" in the title.) This approach runs into severe problems almost instantly - establishing the history of syntactic or semantic patterns is far more difficult than establishing the history of vocabulary or morphology, simply because the former are far less arbitrary and are chosen from a far smaller set of possibilities. To make matters worse, we also find major discontinuities in such patterns in cases where both the population and the vocabulary were relatively stable, such as the transition from Old English to Modern English. Johanna Nichols' efforts point towards the possibility of getting around this by identifying highly time-stable typological features, but the results, at their best, are not nearly fine-grained enough to support narratives of continuity in any specific location. "Continuitarians" in the Arab world apparently haven't gotten around to adopting this approach yet, except occasionally in Morocco, where academic linguistics is unusually advanced for the region; they surely will, however, when they realise that it could be extended to cases like Egypt, rather than being limited to the Fertile Crescent.

For much of the world, especially Europe, a complete lack of ancient written documentation makes another response available: simply argue that the language currently spoken there must have been spoken far earlier than previously assumed, and hence got there not through invasion but through some more peaceful process. This yields the various Paleolithic Continuity Hypotheses. The main problem with this for linguists is that it forces us to postulate a much lower rate of linguistic change for the past than is observed for languages with a long written history, or even for unwritten languages that happen to have been recorded as long intervals; as a result, these hypotheses have remained fairly unpopular. For the Middle East, however, the point is moot: writing has a longer history there than anywhere else on the planet, and that history reveals regular episodes of language extinction, language shift, invasion, migration, exile, and everything else that we're supposed to be de-emphasising.

So if you really want to emphasise your languages' continuity with your ancestors', these are two more promising ways to do it. But I would suggest that there's no reason to bother. If your current identity isn't working out for you, and you don't think you can reform it, why not work on creating a genuinely new one, rather than perpetuating the obsession with heritage by digging around in history for an even older one? It worked out pretty well for America, after all.

Thursday, September 11, 2014

We've seen that historical linguists decide which languages share a more recent common ancestor on the basis of shared innovations (or their absence). But if you're paying attention, you may have noticed a potential problem here: innovations can be shared for at least three reasons:

Common ancestry - the reason why, for example, Proto-Indo-European intervocalic *s has changed to r both in Spanish and in French.

Contact - for example, the change of r (the rolled r you get in Spanish) to R (the uvular r you get in French) started in French, but spread to other European languages such as German, probably due to the prestige of French among the upper classes (actually there's some debate about the direction of spread - see eg this paper by Kostakis - but either way it spread through contact)

Chance - for example, θ (th) has changed to t both in Jamaican English and in Levantine, but not because they share any common history or close ties.

So, when it comes to shared innovations, what can we do to distinguish the "confounding factors" of chance and contact from common ancestry? There are two obvious general approaches. The most securely reliable is to establish relative chronology: if change A was applied to the outputs of change B, then obviously change B is the older. Unfortunately, many pairs of changes are commutative - the relative order makes no difference to the output. That often forces us to resort to the more probabilistic criterion of number of changes: if language A shares a lot of common innovations with language B to the exclusion of C, and only a couple with language C to the exclusion of B, then it's more parsimonious to group A with B and find some other explanation for those shared with C. For better results, we can weight the innovations according to the chances of them occurring independently: for example, a change of ð > d is rather common worldwide, whereas a change of ɬʼ > ʕ is rather unusual.

Levantine Arabic provides a useful case study: as NNT correctly pointed out, it shares a couple of innovative sound changes with Aramaic, in particular θ (th) > t, ð (dh) > d. (The hamza-y correspondence is a different issue - there's massive variation within Classical Arabic on where and whether hamza is realised, as can be seen from the different Qur'an reading traditions, and the consonantal orthography of Classical Arabic obviously reflects a dialect in which, like the majority of present-day dialects but unlike Modern Standard, hamza was hardly ever pronounced). Yet we have seen that Levantine Arabic does not share most of Aramaic's defining innovations, and does share important innovations of Arabic, such as the reflexes of proto-Semitic *g, *θʼ, *ɬʼ, and (depending on reconstruction) *š, the replacement of "say" (originally 'amar-) with qāl-, the metathesis of ʕam- "with" to maʕ-, or almost every detail of the extremely intricate broken plural system. How can this be explained?

If the explanation is common ancestry, then we should find the changes θ > t, ð > d only in Levantine words that are not Arabic innovations. In fact, however, we find them in words such as itnēn "two", in which the i- is an Arabic innovation - cp. Arabic iθnayni (acc/gen), Aramaic trēn, proto-Semitic *θn-ay-n(a). This hypothesis would also fail to account for the rest of the observations; if Levantine shares a more recent common ancestry with Aramaic than with Arabic, and is spoken exclusively in an area once dominated by Aramaic, then why on earth did it pick up so many innovations from Arabic while remaining immune to practically all the innovations Aramaic went through except these two? Both the criteria given above therefore point away from common ancestry as an explanation.

This suggests that we should consider contact. At first sight, you might think the answer is simple: Aramaic speakers couldn't pronounce interdentals, so they left them out of their Aramaic-accented Arabic. But that hypothesis would be absurd. By the late pre-Islamic era, all known varieties of Aramaic did in fact have the sounds θ and ð, due to a later development of t > θ, d > ð after vowels (except when doubled). We find these sounds alive and well in the only surviving Levantine Aramaic dialect, that of Maaloula: eg xoθla "wall", ḳrīθa "village", eḥða "one (f.)". Why, then, would Aramaic speakers change these sounds to t, d in Arabic?

How about the opposite contact situation: Arabic speakers living on the fringes of the Aramaic-speaking world copied the shift θ > t, ð > d from their neighbours, while those living further inland stuck with the traditional pronunciation? That is more plausible, but still a bit problematic. The development of t > θ, d > ð had already happened by 250 BC in Aramaic, so the shift would have to have been borrowed before that; but Arabic-speaking groups which used Aramaic as their high language, such as the Nabataeans or Petra, are only well-attested later than that.

A third, more subtle contact explanation seems preferable. Aramaic speakers would certainly have taken advantage of the many similarities between Aramaic and Arabic to reduce the burden on their memories. But, whereas θ and ð are extremely common in Aramaic, in Arabic they are quite rare: in the Qur'ān, t is ten times commoner than θ, and while ð is about as common as d overall, practically all of its occurrences are limited to demonstratives. A good rule of thumb for the Aramaic learner of Arabic to apply would therefore be "replace Aramaic θ, ð with t, d except in demonstratives"; 9 times out of 10, the result would be correct Arabic, and the 10th time it would still be comprehensible. In such an environment, where Aramaic-speaking learners of Arabic outnumbered native speakers, it's not hard to imagine the distinction disappearing. If so, the loss of interdentals in Levantine would indeed reflect Aramaic influence - as a result of Aramaic speakers' effort to avoid Aramaic forms!

Tuesday, September 09, 2014

Last time, I promised to look at the "ratio of content ⊂ Arabic & ⊄ Aramaic". To do that, we need two things: data on the frequency of different words and morphemes, and etymologies for each word and morpheme. If this were English, I could offer you a 450-million-word online digital corpus for the former, and the OED for the latter. For Levantine Arabic the pickings are a bit scantier. There are indeed several digital corpora of Levantine Arabic, but none of them are publicly available, and none have published any frequency data that I can find offhand; and for etymologies, you have to consult, by hand, as many dictionaries (of several languages) as it takes.

So for present purposes, I will use a much smaller substitute, which can hardly be accused of any partiality to Standard Arabic: namely, a selection from Said Akl'sRoomyo w Julyeet (CORRECTION: introduced by Said Akl), which I was lucky enough to run into at an Oxfam a few years ago. I picked a well-known section of the play whose language seemed relatively simple, with little or no visible Standard Arabic influence - the lines starting from "Romeo, Romeo, wherefore art thou Romeo?" (p. 62), including Romeo's reply and Juliet's reply to him (finishing on the second line of p. 63) - and counted morpheme frequencies (retaining his eccentric orthography). The 26 morphemes that occurred more than once account for about two-thirds of the selection, so looking at their etymologies gives us the maximum of information for the minimum of effort - and here they are. Only those that are unambiguously Arabic or unambiguously Aramaic are relevant to our purpose; the rest may be dismissed as "confounding factors":

b(e)- / m- بـ٬ مـ [marker of the indicative imperfect] (10 occurrences): Innovative. This form is found as such neither in Classical Arabic nor in Aramaic, and its etymology poses some difficulties; if you know of any convincing work on this, let me know in the comments.

-aq ـك "you m. sg. oblique" (9 occurrences): Arabic. Both Aramaic and Arabic have cognates of this, but in Aramaic the consonant has changed to kh, whereas Levantine - like Arabic - has kept the original k.

¢esm اسم "name" (6 occurrences): Arabic. Both Aramaic and Arabic have cognates of this, but in Aramaic the consonant is sh, whereas in Levantine - as in Arabic - it's s. (There is controversy over which value is original.)

la "no, not, neither... nor" (5 occurrences): "Confounding". The form is shared identically by Arabic and Aramaic; the usage is actually closer to Arabic (where it negates verbs only in the imperfect and the negative imperative) than to Aramaic (where it negates verbs in all tenses), but we'll score it as shared.

-u / -h / -vowel length (depending on context) ـه "him, his" (5 occurrences): Arabic. Aramaic -eh could explain the h form and the vowel length form, but the -u can be satisfactorily derived only from Arabic -hu.

quun كون "be" (4 occurrences): "Confounding". In reality this is much more likely to be Arabic, since the normal Aramaic root for "be" is hwy, but kwn is attested in this sense in Aramaic too.

"the" (4 occurrences): Arabic. (Aramaic originally used suffixed -aa, which later lost its definite sense.)

[relative marker] (3 occurrences): Innovative, but based on extending the functions of the Arabic definite article, and probably on shortening a form similar to Classical Arabic alladhii, which it resembles rather more than the Aramaic relative marker dh-.)

ma ما "not" (3 occurrences): Arabic. In Aramaic, maa is never used as a negator.

law لو "if" (3 occurrences): Arabic. (Aramaic does not generally use this, but where traces of a cognate are found, as in some frozen combinations, it takes the form luu, not law.)

cu شو "what?" (3 occurrences): Original, from Arabic. Found as such neither in Arabic nor in Aramaic, but its generally accepted etymology is Arabic, from a contraction of أي شي هو "what thing is it?".

sammi "name (v.)" (3 occurrences): Arabic, for the same reason as esm above.

-a ـا "her" (2 occurrences): "Confounding". At first sight the loss of the h makes it appear closer to Aramaic than to Classical Arabic - but the h was also lost in -u "him", which cannot be explained as Aramaic.

-t ـت [feminine singular construct state marker]: "Confounding". The form is compatible with Arabic or Aramaic origins (Aramaic had th, but we would expect that to be turned back into t, since Levantine has no interdentals.) The function straightforwardly existed in Aramaic; in Classical Arabic, it did not, but the pre-pausal pronunciation of -at- as -ah provides an obvious source for it to develop from, and indeed it exists in practically all modern dialects (including those of the Arabian peninsula). If you're feeling really generous, though, you might ignore the latter fact and award this one to Aramaic.

hu هو "he" (2 occurrences): "Confounding". At first sight the Aramaic form huu is closer than Classical Arabic huwa, but loss of final vowels is regular in Levantine Arabic, so you would expect huwa to become hu anyway.

jez¢ جزء "part" (2 occurrences): Arabic. I haven't noticed an Aramaic cognate, but even if there is one, the palatalisation of the j (from original g) marks it as Arabic.

So, out of these 26 items - which together account for 107 out of the 161 morphemes in this selection - 10 are unambiguously Arabic (accounting for 46 morphemes), and none are unambiguously Aramaic. 15 items (accounting for 91 of the morphemes) could equally well be Arabic or Aramaic, and as such are irrelevant to determining which one predominates within Lebanese Arabic. (If you decide to be really generous to Aramaic, you might shift -a, hu, and -t to the Aramaic column, accounting for a grand total of 6 morphemes versus Arabic's 46.) The remaining single item, the imperfect prefix b-, is a later innovation whose history is unclear; even if someone found an Aramaic etymology for it and added it to all the unlikely cases mentioned, the ratio of "content ⊂ Arabic & ⊄ Aramaic" to "content ⊂ Aramaic & ⊄ Arabic" for this list would still be about 3:1. On a less generous and more plausible calculation, it's infinite (46:0). Either way, by this criterion, too, Levantine is Arabic, not Aramaic.

If you pick a long enough text, of course, you will eventually find an Aramaic loan or two. There are quite a few Aramaic loans in Levantine Arabic, depending on the dialect, and they must really stand out to a Levantine speaker studying Aramaic. But even in the most heavily Aramaic-influenced dialects, they occur far less frequently than unambiguously Arabic forms. While historical linguists' usual definition of language origin does not rely on any explicit frequency criteria, in all the cases I've seen, the most frequent source of vocabulary by token count for a sufficiently large text turns out to be what historical linguists would consider as that language's parent. In Levantine Arabic the effect is even stronger, since not only is the basic vocabulary of Arabic origin, so is most of the learned vocabulary.

Now, after all those calculations, I'm sure you're eager to read the lovers' dialogue, so here it is:

Sunday, September 07, 2014

Following in a long tradition of people imagining that knowing a few languages or a bit of mathematics implies they already know linguistics better than any self-styled specialist, the quasi-celebrity author Nassim Nicholas Taleb recently decided to claim that "Levantine is modernized Aramaic". (Let's not comment further on the attached table, whose attempt at Standard Arabic is painfully bad, and which omits the whole Aramaic column except for the title. Also, let's not confuse it with the separate question of how distant Levantine is from Standard Arabic.) The ensuing Twitter "debate", while of little value in itself, nicely illustrates a number of common misconceptions, some of them worth responding to in a less cramped medium. I'll start with the most explicitly political one, since it's bound to colour responses to any purely academic argument:

Less than 90 km from NNT's hometown is a village where they do in fact still speak Aramaic, while of course still being diglossic in Arabic: Maaloula, in Syria. Despite heavy Arabic influence, this village's language has never once been mistaken for Arabic; its own people call it siryêni, and European Semitists recognised it as Aramaic as soon as even simple wordlists became available. If you happen to be Levantine, try listening to some of it (eg here) - how much of that do you understand? The same is true of other relict Semitic languages within the Arabic-speaking world, such as Mehri or Jibbali or Soqotri or Neo-Mandaic. I have more than one book in which Soqotri or Jibbali speakers attempt to prove that their languages are really Arabic, for much the same reasons that NNT wants his language not to be Arabic - but, notwithstanding the speakers' desires, Semitists had no trouble proving that these languages were not descended from Arabic. Conversely, the "high" languages of Malta have always been English and Italian, yet, despite Maltese nationalists' best efforts to show that Maltese was really Punic, European Semitists had no difficulty in identifying it as descended from Arabic. So, no, Semitic historical linguists do not base their decisions on what kind of diglossia happens to be around, nor were all those 19th-century German Orientalists secret agents sent back in time by the Baath Party. To the contrary, almost all Semitists I've known would be far more excited to discover that some undocumented variety was a new Semitic language than to find out that it was "just" another dialect of Arabic.

How do linguists know that Spanish is descended directly from Latin, not from Italian? Simple: we look for cases in which Italian has made a change - innovated - and Spanish hasn't. Such cases are easy to find: for example, in Italian original *fl has become *fi (thus fiore "flower") and original long *e in open syllables has become i (thus di "of"), whereas in Spanish original *fl remains fl, and *e e (thus flor, de). If Spanish were descended from Italian, then these changes would all have had to have happened and then reversed themselves in Spain, which is very unlikely. We can know which form was original not just because in this case we have copious ancient data, but also by using comparative-historical reconstruction. The full toolkit would take too long to explain here (my favourite textbook is Lyle Campbell's Historical Linguistics), but basically, we:

establish sets of sounds corresponding systematically to one another;

figure out whether these correspondence sets systematically occur only in certain environments, and, if so, see whether there are any other correspondence sets occurring only in non-overlapping environments that they can be unified with.

This procedure allows us to prove that the ancestor language must have distinguished at least as many phonemes as members of the resulting set of correspondence sets, and - combined with a large body of knowledge about likely and unlikely sound changes - gives us a good chance of determining what the actual sound of those phonemes were. This technique was, of course, developed mainly for reconstructing unattested languages, but way back in the 1950s, Charles Hall decided to test it by applying it to Romance. The result was, as you might hope, Vulgar Latin.

Now, let us apply this to Levantine, Arabic, and Aramaic. Reconstructing the common ancestor of Aramaic and Arabic (see eg here or even just here) shows that Aramaic features a number of innovations not shared with Arabic; conveniently, many of these are mergers. In particular, in Aramaic *`, *ʁ (gh), and *ɬʼ (lh) all merge to ` (ayin); *x (kh) and *ħ merge to ħ (heth); initial *w and *y merge to *y. In Arabic, all of these distinctions are maintained. Now, the nice thing about mergers is that they can't be reversed; once two formerly distinct word classes feature the same phoneme, there's no way for the ordinary speaker to recover the distinction. A monolingual Aramaic speaker has no way of telling that the ` in 'ar`ā "earth" (< *'arɬʼ- + -ā) used to be pronounced differently from the ` in ṭar`ā "door", or in `aynā "eye". In Levantine, all of these distinctions are normally maintained, just as they are in Arabic; أرض has none of the consonants of عين. QED. (In fact, historical linguists have also succeeded in identifying some Aramaic loans into Levantine Arabic by finding the small minority of words in which these distinctions were lost.) In fact, you don't even need to look at phonology to figure this out; the grammar provides plenty of clues. In Aramaic, for example, almost every noun ends in -ā, except in a few specific contexts. This is an innovation specific to Aramaic, accomplished by gluing a former demonstrative on to the end of the noun, and preserved in every modern spoken Aramaic variety. In Arabic, it never happened - nor, obviously, in Levantine.

Of course, NNT shows no signs of even being aware of the relevance of regular sound correspondences, mergers, or any of the other elements in a historical linguist's toolkit, much less of accepting them as definitive criteria for language classification. At one point, however, he vaguely expresses the criterion he thinks should be definitive:

Now that we've seen a little bit of how linguists determine what comes from Arabic and what comes from Aramaic, we're ready to look at the results of this criterion in the next post. You should be able to guess the answer already...

Monday, August 18, 2014

From Morocco to Oman, there is a long tradition of imagining that the Berbers of North Africa and the Mehris of South Arabia speak the same language. This is by no means confined to pan-Arab nationalists - Siwis have told me more than once that some friend of a friend had met non-Arabic-speaking Yemenis and understood their language, and I'm told many Mehris have the same belief. I've previously discussed some possible reasons for this belief, as well as the more obviously propagandistic claim that Arabic descends from Berber; both are false.

Nevertheless, it is true that significant numbers of Yemenis participated in the Arab migrations to North Africa during the Islamic era, and it's not inherently implausible that some should have brought their languages with them. In fact, I just came across what looks very much like a South Arabian loan into the northwestern Libyan Berber variety of Zuwara (At Willul).

In Zuwara, the usual word for "father" is baba, as in many other Berber varieties, but in a few collocations such as əg tíddart n ḥíbi-s "in her father's house", a different term ḥibi is substituted (Mitchell 2009:303, 341). This word is unlikely to be proto-Berber, since proto-Berber did not have a phoneme /ḥ/ and since it is quite unusual within Berber. And as far as I know, it is not used anywhere in Arabic (although Libyan dialects are not that well documented). One could try to link it to ḥabīb-ī "my beloved", but that would be phonetically irregular and semantically unlikely, since this term is normally used in the context of romantic love or of a child by their parents.

However, the normal word for "father" in Mehri is ḥīb "father" - ḥayb-ī "my father", ḥīb-as "his father" (Watson 2012:149). In fact, Mehri adds this ḥ prefix to a number of kinship terms: ḥāmē "mother", ḥabrē "son", ḥabrīt "daughter" (ibid), as well as a number of other common nouns. Its function is to mark definiteness (ibid:64). But no such definite article has ever existed in Arabic or in Berber, so the only possible explanations for the similarity of Zuwara ḥibi are pure coincidence or borrowing from Mehri into Berber (perhaps via an Arabic dialect?). It will be interesting to see if other cases turn up.

And as long as I'm talking about Libyan Berber, I really ought to mention Marijn van Putten's new book A Grammar of Awjila Berber (see his announcement at Oriental Berber).. This careful analysis of all the unfortunately limited data available on the very unusual Berber variety of Awjila, in the far east of Libya, is an important resource for Berber historical linguistics. I hope that things settle down in Libya soon enough to make a fuller description possible, but for the moment, this work appears unlikely to be superseded.

Saturday, August 09, 2014

For most of the past decade, while first the rest of Iraq and then Syria (150,000 dead, 2.5 million refugees) have burned, Northern Iraq has seemed like a relative oasis of calm. That has changed rather suddenly: with ISIS' religious persecution, and now American airstrikes, Northern Iraq and its minorities are suddenly prominent in the headlines. The headlines throw into sharp relief the region's status as perhaps the most religiously diverse place in the Middle East - but what they may not show is that this region is also a small-scale "residual zone" preserving rather more linguistic diversity than is typical for such a small area in the modern Fertile Crescent (not just Arabic and Kurdish!)

The most endangered language of the region is certainly Northeastern Neo-Aramaic (NENA), or Sûreth (ܣܘܪܝܬ). Once, Aramaic was the lingua franca of the Middle East, spoken in various dialects from Gaza to Basra, and written as far afield as China and India. By the early 20th century, it was restricted to a few hundred far-flung mountain villages; the largest dialect group, NENA, was centered on the Christian (Assyrian and Chaldean) villages of the Mosul Plain, such as Tel Kef (Telkepe) and Qaraqosh, and across the border in Iran and Turkey; a detailed map is available at Cambridge's NENA Database. Today, those who have stayed behind in ever harder conditions are substantially outnumbered by their diaspora in cities such as Detroit or Sydney, whose children increasingly just speak English - and, as of the past couple of days, media accounts suggest that fleeing refugees have left the Mosul Plain villages practically empty. Their exodus is rather reminiscent of what happened about a century ago: during the Armenian/Assyrian Genocide, the NENA-speaking Assyrians of Hakkari fled from Turkey never to return, taking refuge in Iraq and finally in Syria. It remains to be seen whether this exile will be as lasting as the previous one. If you're wondering how the language sounds, the NENA Database site has a number of recordings, some transcribed, such as The Story of the Cobbler; others can be heard at Semitisches Tonarchiv.

While Kurds prefer to consider Kurdish as one language, the two main Kurdish varieties of northern Iraq - Sorani and Kurmanji - are strikingly different from one another, and are usually considered as separate languages by academics. The smaller Gurani language, (see DOBES), spoken in northwestern Iraq and also commonly labelled Kurdish, doesn't even belong to the same branch of Iranian as Sorani and Kurmanji. Many of its speakers belong to loosely Shia-affiliated minority religions, such as the Ahl-i Haqq and the Shabak, considered by ISIS as beyond the pale.

The other minority group unfortunate enough to have been pitched into the headlines, Yezidis, do not have a language of their own; they speak Kurmanji Kurdish. However, the Yezidis are associated with a unique writing system. In the early 20th century, manuscripts summarising Yezidi beliefs written in a unique alphabet (such as the Meshefa Resh "Black Scripture") came into the possession of Western researchers, and the alphabet in question duly found its way into compendia such as Diringer (1968). Later research, though, suggests that both these manuscripts and the alphabet they were written in were created for Western consumption, likely by a non-Yezidi bookseller, rather than representing a Yezidi tradition (Kreyenbrook and Rashow 2005, EI).

The region's Turkmen, many of whom have also apparently been persecuted by ISIS for their Shiism, speak a Turkic variety close to Turkish and Azeri. From what little information I've seen, it seems unlikely to qualify as a separate language, but does not seem to have attracted much research.

The Arabic dialects of northern Iraq - the so-called qeltu dialects, for their unique pronunciation of the word "I said" - are also quite interesting in their own right; the spoken Arabic dialect of Abbasid Baghdad seems likely to have belonged to this group. However, that is another story for another day...

Monday, July 14, 2014

Linguistically, the northern and southern shores of the Sahara have remained surprisingly distinct, and most Saharan groups are easily identifiable as outposts of one or the other. Occasionally, however, a greater degree of language mixture is found. Nowhere is trans-Saharan language mixture more prominent than in Northern Songhay, a group of languages spoken in Niger, Mali, and Algeria combining a Songhay base with an enormous Berber superstratum, including Korandjé, a southwestern Algerian language I've been working on for a few years now.

Counting cognates makes it very clear that Korandjé is the outlier, as might be expected based on geography:

Korandjé

Tadaksahak

Tagdal

Tabarog

Tasawaq

Korandjé

–

139

140

141

152

Tadaksahak

139

–

242

238

214

Tagdal

140

242

–

304

237

Tabarog

141

238

304

–

229

Tasawaq

152

214

237

229

–

The other three Northern Songhay varieties (treating Tagdal+Tabarog as one variety) form a linkage, which, following Wolff and Alidou's suggestion, we might label Azawagh Songhay - from west to east: Tadaksahak, Tagdal+Tabarog, then Tasawaq. On this wordlist Korandjé is clearly closest to Tasawaq, but that's only because Korandjé and Tasawaq have both kept more Songhay vocabulary, a fact irrelevant for subgrouping. The only innovation in vocabulary that Korandjé and Tasawaq share to the exclusion of the rest is the borrowing of numerals from 5 up from Arabic, and if you look at the sound correspondences it's clear that Tasawaq and Korandjé each borrowed their current numerals separately from different dialects of Arabic. Tadaksahak, Tagdal, and Tabarog all show almost the same number of items shared with Korandjé due to common borrowing from Berber, and most of that is due to shared borrowings of widespread Berber words that could easily have happened independently. The use of a Berber form originally meaning "weaver" for "spider" in Korandjé and Tadaksahak alone is striking, but very likely coincidental.

Another way to look at this is to note that 188 of the 332 items are shared across all of Azawagh Songhay, whereas only 108 are shared across all of Azawagh Songhay plus Korandjé. Of the latter, only 9 are Berber or Arabic loans, while 99 are Songhay retentions:

This list is dominated by basic, rarely loaned words: nearly half of it overlaps with the Leipzig-Jakarta list. However, more culturally specific shared retentions such as "iron", "owner", "cow", "donkey", "horse", "pot", "sew", and "sandals" remind us that the split of Northern Songhay is after all rather recent (much more so, in fact, than these words alone might suggest).

These pan-Northern retentions, however, by no means exhaust the Songhay lexicon of Northern Songhay. Korandjé alone retains some 183 list items of Songhay origin, at least 135 of them shared with Tasawaq, while for many words (eg "four", "green"), only Tasawaq has kept Songhay forms. Well over 227 items have Songhay equivalents in at least one Azawagh Songhay variety, and more than 241 have equivalents either in the Azawagh or in Korandje. If the even more conservative (but extinct) Emghedesie variety were added to the list, that number would no doubt be even larger. Proto-Northern Songhay certainly had a significantly larger Songhay lexicon than any of its descendants does.

[Later addendum]: Removing all words with Arabic-derived Korandje forms from the list makes no difference to the classification; the table ends up like this:

Saturday, June 28, 2014

Sahha Ramdankoum صحّة رمضانكم!‍ ‍This Darja phrase, which might be rendered as "happy Ramadan!", is familiar to any Algerian. It groups with a few others - notably Sahha Ftourkoum صحة فطولاركم "happy fast-breaking dinner!" and Sahha Eidkoum صحة عيدكم "happy Eid!" - as an example of a not very productive template "Sahha X+2nd person possessive" expressing good wishes on the occasion of X. But what is "sahha" doing in such forms?

In many contexts, "sahha" is a noun meaning "health"; we can be sure it is a noun, since it can be the object of a preposition and take personal possessive endings, as in b-sahht-ek بصحتك "good for you" (with your health). But there is also a defective verb, taking 2nd person perfective endings: sahhit صحيت (to a man), sahhiti صحيتي (to a woman), sahhitou صحيتو (to a group) "thanks / well done" (a little stronger than sahha "thanks"). The expected 3rd person masculine singular form of this verb would be sahh صح or sahha صحى; sahh actually is attested as an impersonal verb (ysahh-lek يصحلك "it is appropriate for you"), but its meaning is sufficiently distant that it's not necessarily part of the same paradigm. So in principle, "sahha" in "Sahha Ramdanek" could be interpreted as a noun, or a verb. Is there any way to decide which?

If it's a noun, then the phrase's syntax is bizarre - the literal interpretation would then be "Health is your Ramadan", whereas to make it fit the actual meaning we want at least something like "Your Ramadan is health", which would be the opposite order (?Ramdanek Sahha رمضانك صحة). If it's a verb, on the other hand, the syntax is fine - subjects in Algerian Arabic routinely follow the verb, and perfective verbs are routinely used to express states, so we could interpret it as something like "Healthy is your Ramadan!" or even, if we allow the perfective to be optative as in Classical Arabic, "May your Ramadan be healthy!"

On the other hand, if it's a verb, then it should agree in gender and number with what follows it, with feminine "sahhat" صحات and plural "sahhaw" صحاو. This can't actually be tested directly: in all such expressions that I can think of, the noun happens to be masculine and singular, and this expression cannot normally be extended to congratulate people on other occasions. But if we imagine using this formula to congratulate someone on their happiness, I for one would much sooner say "Sahha Farhatkoum" صحة فرحتكم than "Sahhat Farhatkoum" صحات فرحتكم, which suggests that my mind, at least, is not analysing it as a verb.

Perhaps it's neither noun nor verb, then? There are a few words in Algerian Arabic that form predicates and comme at the start of the clause, but do not take verbal morphology - for instance, makash ماكاش "there is no" or oulah ولاه "no need (for)". Putting it in this class would take care of the problem, but just leads us to a different one: can this class of non-verbal predicators be given a coherent positive definition, or is it just whatever happens to be left over from defining the major word classes?

Be that as it may, best wishes to all readers for this coming month, and, for those fasting it, Sahha Ramdankoum!

Tuesday, June 24, 2014

The number of good Berber descriptive dictionaries has been slowly but steadily increasing in recent years, but Hassane Benamara's new Dictionnaire amazigh-français : Parler de Figuig et ses régions (Rabat: IRCAM, 2013), which I was lucky enough to be lent a copy of lately, is surely one of the best. Apart from being quite unusually large (800 pages), it incorporates examples, multiple senses, pictures of items difficult to describe, an appendix with encyclopedic information on culturally specific words such as festivals and childrens' games. It incorporates a few neologisms useful for schooling, but takes a fairly inclusive attitude towards Arabic loanwords. There are barely 15,000 people in Figuig, but, astonishingly enough, this is actually the second dictionary of Figuig Berber published by a native speaker; the first, Ali Sahli's معجم أمازيغي-عربي (خاص بلهجة أهالي فجيج) (Oujda: Al Anwar Al Maghribia, 2008), was a good effort, but is substantially shorter and used a less accurate transcription. (There's even another linguist from Figuig, Mohamed Yeou, threatening to make a third dictionary – if he goes ahead with the project, he'll have a high hurdle to clear.)

Across the border in Algeria, the situation is rather different. A number of towns across a wide area around Bechar and Ain Sefra speak Berber varieties closely related to that of Figuig, collectively imprecisely termed "Shelha". Some of them seem to be shifting to Arabic (on my latest trip, I was told that in Lahmar they had stopped speaking Berber with their children, and for Igli I had heard the same much earlier.) But little effort – and no official effort, as far as I know – is being made to document them. The only (very) partial exceptions of which I am aware are Igli and Boussemghoun.

For Igli (population 7000), I have already described the local Scouts' efforts to put together an online dictionary. More recently, however, I came across a laudable local attempt at approaching the problem academically: Fatima Mouili's The Berber Speech of Igli, Language towards Extinction. After a very brief summary of Igli grammar and phonology, unfortunately made frequently illegible by font problems, the author discusses the reasons for language shift. Corresponding to my impressions for the region, including Tabelbala, she cites emigration and the desire to ensure educational success as important drivers; others are more surprising, including the immigration of refugees expelled by the French from a nearby village during the Algerian War of Independence. Apparently, her thesis discusses similar issues, for those with 59€ to spare...

For Boussemghoun (population 4000), a few articles and a book by Mohamed Benali may be cited, all focusing – as far as I can see – exclusively on the sociolinguistic situation of Berber in the town. A local Berber-language poet billed as "the Ait Menguellet of Boussemghoun", Bashir Oulhaj, has a considerable presence on YouTube, eg here; he's even been interviewed, by Figuig News. It seems to be treated as the centre for Amazigh identity in the region; the HCA has even organised a symposium there. Nevertheless, little if any descriptive work has been published on its variety of Berber.

Taken together, there are probably more speakers of Berber in southwestern Algeria than in and around Figuig. Why the difference, then? Is it because linguistics is better represented in Moroccan universities than in Algerian ones? (Notwithstanding some interesting work coming out of Algeria, I think that is fair – it would be hard to think of any linguist working in Algeria with a profile comparable to Abdelkader Fassi Fehri, for example.) Or is it because the Amazigh movement in Morocco is less closely associated with one side in the "culture war"? (Benali observes that, while most Semghounis wanted Berber to be taught in schools, they rejected the installation of an HCA office due to distrusting their politics.) Or are there more specific, purely local factors explaining the difference? That would be worth a study in itself – though perhaps not as much so as the Berber varieties in question!

Wednesday, June 18, 2014

Recently I came across a popular article, Where Did Yiddish Come From?, discussing Paul Wexler's eccentric claim that Yiddish is a "relexified" Slavic language (and Modern Hebrew, in turn, "relexified" Yiddish). To make any sense of this claim, we have to stop and consider what historical linguists mean when they talk about language origins.

If you want to learn a language perfectly, the best way to start is to pick it up as a child from your family and the community they're part of. That way, you and your generation end up speaking the same language as your parents and their generation, modulo a few little innovations you threw in just to annoy them. As those little innovations pile up, generation on generation, sooner or later you end up speaking something that the first generation wouldn't have been able to understand. In such a scenario, everyone agrees, the latest generation's language – let's call it B – is descended from the first generation's (A). If some of the children of that first generation moved far away early on and went through the same process of gradual change, their descendants speak another language, C, which speakers of B can't understand, but which is also descended from A. So we say that B and C belong to the same language family, just as their speakers belong at some remove to the same extended family.

If you're reading this, it's probably too late to learn a language that way. (Sorry.) You can still learn another language, say B, but the odds are that, at best, you'll always speak it with a bit of a foreign accent, and keep using expressions that make sense in English but sound weird to native speakers. If you're just an individual migrant learning it to fit in, that won't matter in the long run – your kids will learn the language in the playground and come back speaking it better than you do. But what if it's not just you that's learning it, but also your spouse, and your brothers, and almost everyone you know? What if your whole community is starting to prefer to speak this language with their kids, instead of the one they grew up with? In that case, the kids will still end up speaking it – but instead of speaking it like natives, they'll probably end up speaking it with your foreign accent and all those expressions of yours that native speakers laugh at. In that scenario, does the kids' language (let's call it D) belong to the same language family as B and C, or not? That's the ambiguity that Wexler is playing with.

The obvious answer – and the one most linguists would give – is yes*. For one thing, assuming you did a half-decent job of learning B, it's the same language – speakers of D can understand speakers of B, and vice versa, even if they laugh at each other's crazy accents. The influence of Gaelic may pervade Irish English, but Irish English is still English, not some Celtic language. It's the vocabulary and the morphology that really make English understandable – a weird accent or a funny way of putting things is just not that big an obstacle on its own. Wexler proposes exactly the opposite criterion: "Yiddish – in contrast to its massive German vocabulary – has a native Slavic syntax and sound system – and thus must be classified as a Slavic language" (1993:5). The origins of Yiddish syntax and phonology I can't comment on, but there's a good reason why historical linguists normally prioritise the vocabulary and the morphology over the syntax and phonology, even apart from the one just given. Vocabulary and morphology are eminently reconstructible, using the comparative method. Phonology, on the other hand, can only be reconstructed from vocabulary, and syntax is notoriously hard to reconstruct at all. If language families were to be defined based on phonology and syntax, it would hardly be possible to define them, much less reconstruct them or state regular correspondences between them.

In short, saying that Yiddish (much less Modern Hebrew) belongs to the Slavic language family is just a word game – in the sense that historical linguists normally use the concept of "language family", it doesn't, and wouldn't even if every last Yiddish speaker happened to be of Slavic ancestry and to speak Yiddish with a heavy Slavic accent. But such word games do not vitiate Wexler's work. After a large enough community has shifted to a different language, it is usually possible to find traces of their former language – although identifying them as such, rather than as later borrowings, may be hard. That's what Wexler is trying to do for Yiddish, and that's how he supports his claim that Yiddish speakers' ancestors used to speak a Slavic language.

* However, the question can easily be made more controversial. Suppose you and your community didn't learn it that well to start with, and aren't trying to imitate native speakers anyway? In that case, the kids will end up speaking something that sounds utterly ridiculous to native speakers; the basic words are recognisable, but the way they're put together seems all wrong. Whatever Tok Pisin is, most people would agree that it's not English. A few people would defend the claim that Tok Pisin belongs to the same family as English, on the basis that that's where the vocabulary comes from, but most would say that it doesn't belong to a language family. The language family model presupposes that the language is being passed on reasonably well as a whole, including not just vocabulary but also some amount of grammar; if all that's learned is a bunch of words, the model breaks down. The border must be drawn somewhere between the extremes of Irish English and Tok Pisin, but linguists can and do disagree on where exactly to draw it.