I just completed my first guest blogging post over at mind x the + gap where I talked about the mutual history of language and commerce, as well as some thoughts on how that will continue into the future. Since the focus of Mil Joshi‘s blog is more towards psychology and economics, the following is a slight adaptation more in line with my normal content.

Commerce is a human convention deeply entwined with language. Economic motivations were among the many reasons ancient (and modern) empires conquered other lands, spreading their languages beyond their natural range. Traders would travel to distant lands, encountering speakers of exotic languages. And where two languages meet, words begin to exchange back and forth. In cases where bilingual speakers were few to none, Pidgin languages developed. Pidgins are languages with simplified grammar and vocabulary, and are never spoken as a first language. They come about as a means of communicating between speakers of different languages for the purpose of trade. When a Pidgin is spoken widely enough that children in the community grow up learning it as a first language, the language changes into a Creole. Creoles have many fascinating characteristics, but the point here is, commerce is a driving factor in their creation. When a conquering empire brings its own language, it either supplants the native language or influences it heavily. Pidgins, on the other hand, develop because speakers are motivated to communicate in order to trade.

Groups of speakers who remain in constant contact tend to speak the same dialect of a language. When a group breaks off and becomes isolated (contact with the original group is infrequent or not widespread), their dialects begin to diverge. Mass communication is changing this landscape, allowing larger and larger people groups to remain in constant contact. As a result, minority languages are being spoken even less in favor of popular languages. This process is called linguistic homogenization. If we follow the slippery slope to the extreme, eventually there will be a single language spoken by all people. This eventuality isn’t likely to happen in our lifetimes, and not just because it requires almost all native speakers of a language to die out. A far more likely scenario is that a handful of commerce languages will be spoken by the vast majority of people. Commerce languages are popular languages people speak to do business in (English, Mandarin, etc).

There are many factors driving linguistic homogenization. Commerce is certainly one of them. In the modern world of the internet and mass media, attention is the scarce resource people are competing for. If you want to capture the attention of others, you need to maximize your reach and doing so typically means choosing a language of commerce. Minority languages present a barrier to the widest possible dissemination of information (except when the only intended audience are speakers of that language). The attention economy promotes linguistic homogenization.

Machine translation services, such as Google Translate, potentially have the power to change this. As the quality of these services improve, it becomes less and less necessary to publish exclusively in commerce languages. Linguistic homogenization may not be the inexorable force it appears to be today. Of course, the output of machine translation can be pretty abysmal. Will the quality of machine translation improve fast enough, and will the business case for them be strong enough to turn the tide of linguistic homogenization? Those betting on machine translation services surely hope so. But there is a dueling problem here. In order for machine translation to truly counteract linguistic homogenization, it has to be freely available (or ridiculously cheap). These systems are difficult to build and require great computational resources. The outcome will almost certainly be a matter of economics as well as science.

While the future progress of commerce and language may be uncertain, what is certain is that they will continue to heavily influence each other. And there’s nothing new about that.

I hereby declare that the word literally has not lost its meaning, despite a rash of rumors to the contrary.

What would it even mean for a word to lose its meaning? A word can change from one meaning to another, certainly. Maybe you could argue that a word that has dropped out of usage has lost its meaning..

You hear complaints of that sort all the time, but what is being missed is the fact that language is fluid. Meanings evolve as the need arises (and there are many kinds of needs). Speakers each carry a somewhat different representation of the language in their heads, and once like-minded speakers agree on a novel usage and adapt it into their own representations, language evolves.

The debate over literally is literally nothing new. Turning to old faithful, the American Heritage dictionary:

Usage Note: For more than a hundred years, critics have remarked on the incoherency of using literally in a way that suggests the exact opposite of its primary sense of “in a manner that accords with the literal sense of the words.” In 1926, for example, H.W. Fowler cited the example “The 300,000 Unionists … will be literally thrown to the wolves.” The practice does not stem from a change in the meaning of literally itself—if it did, the word would long since have come to mean “virtually” or “figuratively”—but from a natural tendency to use the word as a general intensive, as in They had literally no help from the government on the project, where no contrast with the figurative sense of the words is intended.

So literally has been known to be a general intensive for quite some time. Why the fuss now?

Twitter is my new linguistic data collection engine, btw. Just some of the multitude of great results:

ipodrulz: My dog is whining because I’m keeping her up! She hates it when she’s asleep and I’m not… fucking bitch – literally.

Whenever I hear the word enormity used to describe how gi-freakin-normous something is, I always willfully misinterpret it to mean an act of extreme evil or extreme wickedness. Now before you start screaming prescriptivist and throwing Kleenexes drenched in the snot of sociolinguistics at me — I’m not being a prescriptivist. Of course people have the right to use enormity that way. It is certainly the trend for that word and it probably will be within my generation that almost everyone forgets its original meaning. I just so like the meaning of extreme wickedness that I want to be able to use it to mean that without being misinterpreted. And a lot of people only know that word to mean gigantic.

So I was listening to a promo video (below) by Richard Branson of Virgin Galactic. Branson opens up with this line:

“Astronauts of the past 45 years have all returned to Earth struggling to convey the enormity of what they have discovered and with their perceptions clearly changed.”

And quite frankly, the sinister music blends with my interpretation of enormity far better. Astronauts have all returned overwhelmed by the vast wickedness they encountered in space. Awesome! I totally wanna go now. Actually, I’ve always wanted to go and probably would go even if I was told I had a 50/50 chance of making it back alive, so enormity just ups the thrill level.

In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump. In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task. (more…)

In my previous post on cognate identification, I gave two definitions for cognates: strict and loose (orthographic). Strict cognates are words in two related languages that descended from the same word in the ancestor language. Loose cognates are words in two languages that are spelled or pronounced similarly (depending on the data consists of phonetic transcriptions or plain text). These two definitions help form the basis for how I choose to classify approaches to doing cognate identification, but the source of data is the bigger factor, in my opinion. The orthographic approach looks at plain text and attempts to do some sort of string matching or statistical correlation based on the written (typeset) characters of the language. The phonetic approach relies on phonetic transcriptions of words in the language. Phonetic transcriptions are usually done in the International Phonetic Alphabet (IPA) but any standard form of representing sounds will work. One such example is the Carnegie Mellon Pronouncing Dictionary. Phonetic approaches may use string matching techniques, but there are also a number of inductive methods based on phonology that have been tried to good effect.

So a good question might be why does the data being used matter so much to these techniques? Why not classify the two approaches as to whether they look for loose or strict cognates? Might there not be another way of classifying the approaches to cognate identification beyond these two?Or is there an entirely different set of classes that would better describe them? To answer the last two questions, I will say that there very well may be better ways of classifying these algorithms. As Anil pointed out in the comments to my last post, the two definitions lend themselves to different applications. From the papers that I read, it seemed that when researchers looked at plain text data, there was a completely different mindset than in papers where researchers used phonetic transcriptions. For the former, the goal was usually finding translational equivalences in bitext and for the latter the goal is more as an aid to linguists attempting to reconstruct dead languages or establish relationships between languages.

With plain text, it is very difficult to infer sound correspondences between two languages. In Old English, the orthography developed by scribes corresponded directly to the spoken form. As English changed over the 1000+ years since then, the orthographic forms of words have frozen in some cases and not in others. For example, the word knight was originally spelled cniht and the c and h were both pronounced. The divergence of orthographic and phonetic forms can result in any number of problems and so it influences the ways of thinking about the task. On the other hand, phonetic approaches suffer due to data scarcity. Obtaining phonetic transcriptions is expensive as it requires the effort of linguists or individuals with specific, extensive training in the area. There are ways of obtaining phonetic transcriptions automatically, but these methods are not perfect and so result in noisy data, making this data practically useless for historical linguists.

In my next post, I will go into orthographic approaches in more detail, describing some of the papers I looked at and the methods they used. After that, I will begin discussing phonetic approaches, which are more numerous. I will also begin to look at how machine learning is being used to tackle cognate identification.

I recently finished a literature review for my Language & Statistics 2 class. The topic was computational models of historical linguistics and my partner and I focused on cognate identification and phylogenetic inference. We split the work and my part was cognate identification. So I decided to blog about it for a bit and maybe someone out there will have something to offer. Granted, that won’t help my grade, but improving my understanding is more important. You can also check out our presentation.

First of all, to frame the problem, historical linguistics is a branch of linguistics that studies language change. Language can change in many ways, but the methods we looked at pretty much solely focused on phonological and semantic changes, with a few brief nods to syntactic change (on the phylogenetic inference side). The main tool used by historical linguists in reconstructing dead languages is the comparative method. This method looks at two languages suspected of being related and tries to infer the regular sound changes that led to the divergence. By examining lists of suspected cognates, they find sound correspondences — sounds that appear in similar contexts in both languages, but which aren’t necessarily the same phoneme. For example, the word for beaver in English and German derives from the Proto-Germanic word *bebru. In Old English, this became beofor (the f sounds like a /v/). In modern German, the word is Biber, with the /b/ phoneme preserved as it was in Proto-Germanic. So we could infer a sound correspondence between English /v/ and German /b/ in this context.

So what are cognates? If you have studied a second language, you no doubt have heard this term. I propose the following two classifications for cognates. A loose cognate will be a pair of words in two languages that is spelled or pronounced the same, with some minor variations. In this way, French resumé and English resumé would be considered cognates. Loose cognates have also been called orthographic cognates. A strict cognate is a pair of words in two related languages that descended from the same word in the ancestor language. Loan words are words that come into a language directly from another language, such as resumé. These words do not undergo the regular sound changes that are observed in strict cognates and so they are not considered cognates at all by historical linguists.

What is the effect the distinction between these two definitions would have on computational approaches to this task? I will look at this further in a future post, but feel free to post your thoughts in the comments.

Language Log brought up the usage of the phrase another thing coming today. This is the only way I’ve ever heard it or seen it used. But it turns out, the original is another think coming. The thing version is winning out on the interwebs, but the post on Language Log indicates that the two phrases may have been warring since their (mutual?) inceptions. It’s no surprise to me that thing would replace think in this case, for simple phonological reasons. The [k] in think is preceded by a voiced nasal sound (the vocal cords are vibrating) and then followed by a unvoiced velar stop (aka plosive, but essentially another [k] sound). The phenomenon of assimilation occurs when a phoneme changes to reflect the surrounding phoneme(s). In this case, the [k] probably originally became voiced, which would make it a [g] sound. The [k] and [g] sounds are essentially the same, it’s just a difference in whether your vocal cords are vibrating. So, assimilation generated thing instead of think in regular speech and since that is a well known word, people interpreted it as thing instead of think when they were first exposed to it. From there it has been gaining steam.

Another interesting example of a similar nature is home in on versus the original hone in on.