Cheap language detection using NLTK

Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!

What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:

Nimrod:
1. Thanks for the link!
2. I did a quick evaluation on my dataset, saw that it was reasonable, and left it at that. For the English/non-English bit it had a few false-negatives, but mostly for very short texts, and for longer texts it had very few errors. As I wrote, the code is not too rigorously tested :)

Short texts are the hardest. A naive bayes character n-gram model works relatively well if the text is proper English and not full of names/internet abbreviations like ‘lol’ or ‘omg’. It’s very easy to code, but there’re also ready-made libraries for this. See for instance Lingpipe’s tutorial at http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html