I just completed my first guest blogging post over at mind x the + gap where I talked about the mutual history of language and commerce, as well as some thoughts on how that will continue into the future. Since the focus of Mil Joshi‘s blog is more towards psychology and economics, the following is a slight adaptation more in line with my normal content.

Commerce is a human convention deeply entwined with language. Economic motivations were among the many reasons ancient (and modern) empires conquered other lands, spreading their languages beyond their natural range. Traders would travel to distant lands, encountering speakers of exotic languages. And where two languages meet, words begin to exchange back and forth. In cases where bilingual speakers were few to none, Pidgin languages developed. Pidgins are languages with simplified grammar and vocabulary, and are never spoken as a first language. They come about as a means of communicating between speakers of different languages for the purpose of trade. When a Pidgin is spoken widely enough that children in the community grow up learning it as a first language, the language changes into a Creole. Creoles have many fascinating characteristics, but the point here is, commerce is a driving factor in their creation. When a conquering empire brings its own language, it either supplants the native language or influences it heavily. Pidgins, on the other hand, develop because speakers are motivated to communicate in order to trade.

Groups of speakers who remain in constant contact tend to speak the same dialect of a language. When a group breaks off and becomes isolated (contact with the original group is infrequent or not widespread), their dialects begin to diverge. Mass communication is changing this landscape, allowing larger and larger people groups to remain in constant contact. As a result, minority languages are being spoken even less in favor of popular languages. This process is called linguistic homogenization. If we follow the slippery slope to the extreme, eventually there will be a single language spoken by all people. This eventuality isn’t likely to happen in our lifetimes, and not just because it requires almost all native speakers of a language to die out. A far more likely scenario is that a handful of commerce languages will be spoken by the vast majority of people. Commerce languages are popular languages people speak to do business in (English, Mandarin, etc).

There are many factors driving linguistic homogenization. Commerce is certainly one of them. In the modern world of the internet and mass media, attention is the scarce resource people are competing for. If you want to capture the attention of others, you need to maximize your reach and doing so typically means choosing a language of commerce. Minority languages present a barrier to the widest possible dissemination of information (except when the only intended audience are speakers of that language). The attention economy promotes linguistic homogenization.

Machine translation services, such as Google Translate, potentially have the power to change this. As the quality of these services improve, it becomes less and less necessary to publish exclusively in commerce languages. Linguistic homogenization may not be the inexorable force it appears to be today. Of course, the output of machine translation can be pretty abysmal. Will the quality of machine translation improve fast enough, and will the business case for them be strong enough to turn the tide of linguistic homogenization? Those betting on machine translation services surely hope so. But there is a dueling problem here. In order for machine translation to truly counteract linguistic homogenization, it has to be freely available (or ridiculously cheap). These systems are difficult to build and require great computational resources. The outcome will almost certainly be a matter of economics as well as science.

While the future progress of commerce and language may be uncertain, what is certain is that they will continue to heavily influence each other. And there’s nothing new about that.

The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency. Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence. Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English). The judges give each sentence a score for both in the range of 1-5, similar to a movie rating. It became apparent early on that not even humans correlate well with each other. One judge may be sparing with the number of 5’s he gives out, while another may give them freely. The same problem crops up in recommender systems, which I have talked about in the past.

It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged. The better an MT metric correlates with how human judges would rate sentences, the better. This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems. If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them? The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.

Vilar et al. (2007) propose a new way of handling human assessments of MT quality: binary system comparisons. Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better. The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency. By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.

The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings. The total number of ratings for the previous method was two per sentence: O(n), where n is the number of systems (the number of sentences is constant). Binary system comparisons requires more ratings because the systems have to be ordered: O(log n!). In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.

What would be interesting to me is a movie recommendation system that asks you a similar question: which do you like better? Of course, this means more work for you. The standard approaches for collaborative filtering would have to change. For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies. Also, people will still disagree with themselves (in theory). You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure. You’d have to find some way to deal with cycles like this (ignoring it is one way).

I think numbers 1-3, 5, and 6 are almost certainly doable (though they all lie outside of my expertise). Number four will at least make very long strides towards being widespread and easy to use. I seriously doubt it will be perfect (and by perfect I mean as good as a trained translator). Number 7 I have no idea about, but getting management to understand the exact benefits of IT has been elusive for the past twenty-five years. I doubt IT managers even have that kind of understanding about it. There are so many variables. As humans are made to become more and more slaves to their corporate overlords (i mean protectors), perhaps prodcutivity will become more predictable.

Systran is one of the oldest companies around that provide machine translation software. They power some language-pairs of Microsoft’s translation service, Altavista’s Babelfish, and quite a few others (including, until recently, Google). In the past, their software has been rule-based, so translation is done with a bilingual dictionary and a set of rules of how to change text from one language into another. Based on a recent bevy of jobs postings on Linguist List, it appears they are going statistical. Maybe they have been for a while, I don’t know, since I don’t actually follow what they do, but this piqued my interest.

Stepping back in time in MT Eval from my last post, Liu and Gildea (2005) were among the first to really bring syntactic information to evaluating machine translation output. They proposed three metrics for evaluating machine hypotheses: the subtree metric (STM), the tree kernel metric (TKM), and the headword chain metric (HWCM). STM and TKM also had variants for dependency trees, which HWCM relies on. Owczarzak et al. (2007) extended HWCM from dependency parses to LFG parses. HWCM has attracted more attention since it showed better correlation at the sentence level than either STM and TKM (both versions) and outperformed BLEU on longer n-grams. It’s interesting to note, though, that the dependency-based tree kernel metric performed best of all at the corpus level. Sentence level granularity is typically more important for helping you tune your MT system.

The subtree metric is a fairly straightforward idea. You begin by parsing both the hypothesis and the reference sentences using a parser like Charniak or Collins to get a Penn TreeBank style phrase structure tree. You then compare all subtrees in the hypothesis to the reference trees, thresholding the number of matches by the best match in the reference trees. The formula is given below:

The tree kernel metric uses convolution kernels discussed by Collins and Duffy (2001). For the specifics of this method, I refer you to the respective papers (and I may post on it at a later date), but the general idea is that you can transform structured data (a tree) into a feature vector by using the kernel trick. Finding all subtrees of a tree can be exponential in the size of the sentence, which would make computation infeasible for large sentences. The kernel trick lets us operate in this exponentially-high-dimensional space with a polynomial time algorithm. Once we have constructed the feature vectors for the hypothesis and refernece trees, we can score them with their cosine similarity:

H(T1) and H(T2) are vectors with non-zero values for subtrees (dimensions) that appear in each tree, so the dot product of the two is the number of subtrees in common. The score is computed as the maximum cosine similarity between the hypothesis and the references.

Finally, the headword chain metric (HWCM) relies on dependency parses, which I touched on in my previous post.

In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains.

The HWCM score is calculated just like the STM except by comparing headword chains. The difference between the HWCM and the dependency version of the STM is that STM considers all subtrees whereas HWCM only looks at direct mother-daughter relations (no cousins or sisters).

References

Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.

Since Papineni et al. (2002) introduced the BLEU metric for machine translation evaluation, string matching functions have dominated the field. These metrics work well enough, but there are cases where they break down and more and more research is revealing their biases. Also, BLEU does not correlate especially well with human judgments, so the quality of MT would benefit from a metric that better captures what makes a good translation.

A recent trend in this direction has been to introduce linguistic information in MT eval. Liu and Gildea (2005) used unlabeled dependency trees to extract headword chains from machine and reference translations to evaluate MT output. To define a few terms, reference translations are human translations that machine translations are compared to during evaluation. In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains. Liu and Gildea compared the headword chains constructed for both machine and reference translations and produced a metric based on comparing the two sets of headword chains. These chains were not annotated with any sort of grammatical relation (subject, object, etc), so they are unlabeled dependencies.

Owczarzak et al. (2007) have extended the work by Liu and Gildea (2005) using labeled dependencies. They parsed the pairs of sentences with a Lexical Functional Grammar (LFG) parser by Cahill et al (2004). In LFG, there are two components of every parse: a c-structure (i.e. a parse tree) and an f-structure, which describes the features of the lexical items. An example of an LFG parse from their paper is given below. F-structures are recursive structures with a head containing all of its constituents. From the f-structure it is easy to construct dependency trees. The bonus is that the f-structure provides the grammatical relations between items in the dependency trees. In the example below, the dependency subj(resign, john) has the grammatical relation of subject. That is, John is the subject of the sentence headed by the verb resigned.

Their metric is then simply a comparison of these labeled dependency headword chains using precision and recall to compute the f-score (harmonic mean). One of the coolest things in the paper is how they handle parser noise. Statistical parsers are not perfect. They estimate probabilities for rules from labeled data. In natural language, variation is pretty much unlimited, so no matter how big the training corpus, there will always be things the parser has never seen before. Also, we are dealing with imperfect input (by the MT systems or humans) so the problem of noise could be even greater. They address this by running 100 sentences through the various MT metrics they are comparing (including their own) as both the reference machine translation. This produces the “perfect score” for each metric since they are identical. Next, adjuncts are rearranged in the sentence so that the resulting meaning has not been changed, but the structure has. Each MT metric now evaluates the new sentence compared to the original and computes a score. For the LFG parse, the f-structure should remain the same in both cases, so any divergence can be attributed to parser noise. In order to this noise, they used the n-best parses and were able to increase the f-score, bringing it closer to the baesline (ideal). So instead of just comparing the best parse for the reference and machine translation, they combine the n-best parses to compute the f-score.

The result is that they get correlations with human judgments competitive with the best system they compare themselves to (METEOR, Banerjee and Lavie, 2005), beating it for fluency and coming in a close second overall. As far as future work goes, there are quite a few extensions they mention in the paper. The LFG parser produces 32 different types of grammatical relations. In the current setup, they are all weighted the same, but they would like to try tuning the weights to see how that affects the score. Another extension they propose is using paraphrases derived from a parallel corpus. There has been other work done on paraphrasing for MT evaluation (notably Russo-Lassner et al., 2005). One thing I am curious about is whether changing the weight on the harmonic mean would have an impact on correlation. METEOR uses the F9-score while the typical thing to do is F1. It’s not clear that weighting precision and recall equally is the best thing to do.

Interesting stuff, though. I hope they continue the work and maybe we’ll see something in this year’s ACL.

Update

Karolina Owczarzak has confirmed they were using the F1 score and that different F-scores did not lead to significant improvements. I also added the image I forgot to include in the original post.

Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.

At ACL this year, the Third Workshop on Stastical Machine Translation will be held and they are featuring a shared task on MT evaluation. The shared task will involve evaluating output from the shared translation task, which will be released on March 24th, with short papers and rankings due on April 4th. I created an MT evaluation system (pdf) last year for a class (on MT, no less), though I doubt it would do particularly well. I outperformed BLEU, but fell short of METEOR. In any case, it might be interesting to play with the data and certainly will be interesting to read the papers. My system does perform sentence-level ranking as one of its primary goals, which is also a goal stated by the shared task.