Monday, May 12, 2014

Automated natural language processing

Natural language processing is all about getting computers to automatically extract information from natural (human) languages, rather than from specially designed computer languages, or even from mathematical datasets.

Each year the Conference on Computational Natural Language Learning (CoNLL) features a practical task, in which participants train and test their own language-parsing systems on exactly the same natural-language datasets. For the tenth CoNLL (CoNLL-X), in 2006, the task was Dependency Parsing. (Previous tasks had included chunking, clause identification, named entity recognition, and semantic role labeling.)

Parsing refers to identifying the words, their associated part of speech (noun, verb, etc) and their syntactic relations (subject, predicate, etc) based on the formal rules of grammar. In computational linguistics the result is often represented as a tree diagram showing the relationships among the words. From this tree we can try to understand the exact meaning of the text. Wikipedia, of course, has an article with more details, if you are interested.

For the 2006 CoNNL, the 18 parsing algorithms were tested using treebanks for 12 different languages. In linguistics, a treebank is a previously parsed body of text with the syntactic or semantic sentence structure annotated. So, the idea is to use some existing treebanks (produced by hand) to train the parsers, and then test them on some new treebanks, to see if they can produce the correct tree. In particular, the testing in 2006 involved what is called dependency grammar, which gives primacy to the verb as the structural center of a clause.

The paper by Buchholz and Marsi (2006) discusses the treebanks for the 12 languages, describes how they were converted into the same dependency format, and provides an overview of the parsing approaches taken by the 18 participants. The methods are named after the first author of the associated paper.

I analyzed the results using a couple of phylogenetic networks. As usual, I used the manhattan distance to evaluate the multivariate relationships in the data, and displayed this using a NeighborNet.

The first graph shows the relationships among the different parsing methods. Methods near each other in the network have a similar parsing success, while methods further apart are progressively more different from each other.

The methods form a simple gradient of increasing average success, from top-left to bottom-right. This means that the methods do not vary much in their success from language to language — if they are successful at parsing one language then they are successful on the other languages as well, and if not then not.

Perhaps this is not unexpected. However, the two most successful methods, by McDonald and Nivre, have quite different approaches to parsing — they differ on nine of the ten characteristics listed by Buchholz and Marsi (2006). Their very similar success is therefore noteworthy — there is apparently more than one way of skinning this particular cat.

The second graph shows the relationships among the different languages used. Languages near each other in the network have a similar parsing success, while languages further apart are progressively more different from each other.

The languages also form a simple gradient of increasing average success, from top-right to bottom-left. The average success at parsing Japanese was 86% (range 65-92%) and the average success at parsing Turkish was 56% (range 38-66%). This does not necessarily mean that Japanese is generally easier to pars than Arabic, Slovene and Turkish, because the datasets themselves varied considerably in the type of text contained in their treebanks. Nevertheless, Arabic, Slovene and Turkish are all "morphologically rich" languages, and parsing them is expected to be hard. It is interesting to note that Dutch is different from the other Germanic languages (Danish, German and Swedish), and Spanish is different from Portugese.

The practical task for the 2014 conference will be Grammatical Error Correction, which was also the task for 2011–2013. The parsers will be given short English texts written by non-native speakers of English, and they will be evaluated on their ability to detect the grammatical errors and provide corrected texts. English is an ideal language for this task, as it is often suggested that for every native speaker of English there are 4–5 non-native speakers, and therefore automated correction of text would be of enormous practical value. (Mandarin Chinese has more speakers in total, but most of these are native speakers.)