Friday, October 21, 2011

Language detection with Google's Compact Language Detector

Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar.

I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).

So detecting language is now very simple from Python:

import cld
topLanguageName = cld.detect(bytes)[0]

The detect method returns a tuple, including the language name and code (such as RUSSIAN, ru), an isReliable boolean (True if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.

You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.

You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded META http-equiv tag in the HTML), as well as the domain name suffix (so the top level domain suffix es would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:

Generated by dsites 2008.07.07 from 10% of Base

How I wish I too could build tables off of 10% of Base!

The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate. I only understand bits and pieces about how it works; you can read some details here and here.

It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out. This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging. The README.txt has some more details.

@Mike: regarding the missing encoding implementation in the Python binding: I provide a class Encoding with all the defined encoding integers as class constants. Check regenerate-encoding-table.sh and cld_encodings.h, maybe that’s something for you too.

I was not aware that Chrome had a language identifier built into it, I had always assumed that it was done via queries to Google's Language Identification AJAX API. I have been developing my own approach to open-web language identification, it is available at https://github.com/saffsd/langid.py , and is based on my research that will be presented at IJCNLP2011 (http://www.ijcnlp2011.org/). I will compare my system to CLD when I can find some time to do so!

I just wanted to add that my paper on cross-domain feature selection for language identification has been published. It is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf

I ran langid.py on 18 of the Europarl languages and it performs very well! 99.20% (17856 / 18000) vs the best (Java language-detection Google code) at 99.26% (17866 / 18000). Impressive! Especially considering how small the overall model is (and I love how it's just packed into a big Python string!).

da (95.4%) and sl (97.3%) are the two most challenging languages.

Also, this brings the "majority rules" (across all 4 detectors) accuracy up to 99.73% (17952/ 18000), which is awesome (means langid.py is pulling from relatively independent "signals" than the others).

Hi! I have problems with building and installing this module on Windows7 (needed to install gcc witg mingw but when this was done, still was many errors while compiling) and Linux Ubuntu (full console with many errors like "./ceval.h:125: error: expected constructor, destructor, or type conversion before ā(ā token") - any suggestions what I else need to build your code?

I was able to build on Windows, using the checked in build.win.cmd (I'm using Visual Studio 8), but I don't have the older Visual Studio installed to compile the Python bindings.

On Linux (Fedora 13) I compiled fine with build.sh, using gcc 4.4.4, and then build the python bindings using setup.py.

Hi,I'm currently trying to follow the steps to build the python bindings for the CLD under Windows 7. I (successfully) built the library using 'build.win.cmd', but when i try to run the python set up, I run into an error I don't know what to do with:

Glad to hear that langid.py is working well for you. I'm continuing to develop it, trying to get it to work well with even shorter strings in even more languages.

@Anonymous RE:TextCat, I compared my tool langid.py extensively to TextCat in my recently published paper, a copy is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf . Our findings were quite straightforwards, the performance of TextCat really starts to fall off as more languages and more domains are considered.

I've been experimenting with Chrome CLD and langid.py, analysing 31,160 tweets containing "S4C" (the name of Wales' Welsh language TV channel - though not all references to S4C will have been to the channel). Most of the tweets are in English or Welsh: according to Chrome CLD there were 16,339 in English, 9,464 in Welsh. langid.py made it 18,219 English, 10,303 Welsh. Chrome CLD left 4,138 as "unknown". 8,981 were categorised as Welsh by both of them. Chrome CLD left 1,108 of langid.py's Welsh categorised tweets as unknown. You might be interested to see how the percentage agreement between Chrome CLD and langid.py varied by length of the tweet in this chart: http://dl.dropbox.com/u/15813120/Chrome_CLD_v_langid.py_Saes.png

(I know tweets are only meant to be 140 chars and my chart shows tweets longer than that. I reckon that must be because of encoding problems which I must have failed to cope with somewhere).

Thanks for sharing! The comparison is cool. Let me check that I understand 'agreement' correctly- it basically means that the two systems produced the same output for a given message?

So if for a given message, langid.py output 'en' and cld output 'UNKNOWN', then this would be considered disagreement correct? My guess (not based on evidence!) is that cld will tend to output 'UNKNOWN' more often for shorter messages, and that this may account for some of the difference. I would be curious to see a comparison of messages where neither system labels the message 'unknown'. Also, both systems provide a measure of confidence, so you could also consider the correlation between confidence and the accuracy.

On the message length issue, I believe Twitter allows for 140 UTF8 codepoints. If I recall correctly, UTF8 can use up to 6 bytes per codepoint, allowing for theoretical upper bound of 840 bytes.

Sorry for taking so long to reply! Yes, "agreement" means the same language was detected.

I didn't get any 'unknowns' from langid.py. Excluding the 4,138 'unknowns' produced by Chrome CLD gave me this: https://dl.dropbox.com/u/15813120/no_unknowns_Chrome_CLD_v_langid.py_Saes.png i.e. much higher proportions in agreement, with the proportion dropping off when the tweet gets shorter than 70 characters. Such short tweets are less common though, as shown in this density plot: https://dl.dropbox.com/u/15813120/density_no_unknowns_Chrome_CLD_v_langid.py_Saes.png

Thank you for providing a Python wrapper for the CLD. I compiled the latest version with MinGW and it works. One thing however: You write in the README that you made no changes to the original Chromium source but at least encodings/compact_lang_det/compact_lang_det.h differs.

Aha! You are right David; I actually did modify compact_lang_det.{h,cc}, and compact_lang_det_impl.{h,cc}. These files provide the "entry points" to the core CLD library... and my changes were minor: I removed a few entry points that were likely backwards-compatible layers for within Chrome (not important here), and I also opened up control over previously hardwired functionality for removing weak matches and picking the summary language.

Also, cld_encodings.h is new (I copied this from the PHP port), and it just provides mappings from the encoding constants to their string names...

Actually, 78 and 22 is the "percent likelihood" for the match, and then the number after that is called "normalized_score" in the code. I'm really not sure exactly how to interpret these numbers, except to say that higher numbers mean stronger matches...

The net number of bytes matched is returned at the top (ie, not per language that matched); in your case it's 264.

It's great to find your tool online!I was trying to run it to test, however, in the new package I am missing the 'bindings' dir.. I see it in the sources, though.. How do I get the bindings dir without actually checking out the source code?

I've just downloaded and compiled/installed from the .tar.gz source file and the 'bindings' dir is actually missing (as mentioned by Liolik).http://code.google.com/p/chromium-compact-language-detector/downloads/detail?name=compact-language-detector-0.1.tar.gz&can=2&q=

Hi Mike,I am new to python, how do i install cld module to my python. I dont know how to install new modules to python. tried looking at many blogs most tell using setup.py file by command: python setup.py install . But I dont find such a thing in the cld files. please help me i need langauge detection even its not pythonic way. so suggestions other than python are also more than welcome. Thanks a lotManoj

Have installed module and bindings apparently correctly. I have a chromium_compact_language_detector-0.1.1-py2.7.egg-info and cld.so in /usr/lib/python2.7/site-packages which I take as rather positive signs!

However, when in the python interpreter upon trying to import cld I get

Working on the issue on SE here:http://stackoverflow.com/questions/13473861/encoding-issues-with-cldLikely not related to CLD, but my rookie python skills in getting the required string across to CLD?

One thing you might be able to help me with (BTW, your name will appear in the credits on final map I am making) is if my CLD parameters are accurate. At the moment, I am seeing some strings being misinterpreted.

That first value is in fact the predicted language, and it's clearly wrong in your example! Urgh. I confirmed I get the same results on Linux ...

Maybe try passing pickSummaryLanguage=False? There is some "smarts" in CLD that sometimes picks a weaker matching language as the choice, as happened in this example. When I pass pickSummaryLanguage=False, it gets the correct answer for your example ... and I think when I ran my benchmark I passed False.

The 26/8 are what CLD calls the "normalized score"; I'm not sure how it's computed ...

I also find that confusing ... somehow whatever "smarts" is implemented in the pickSummaryLanguage=True is able to take a worse-scoring language and pick it ... I'm not sure why :) And I think in my original tests I saw worse accuracy if I passed True.

Thanks Mike. Im getting better results than before with the psl=t, but im thinking that I should perhaps add logic to manually loop through the list of results, and pick the one with the highest score, as opposed to relying on the first result returned. Where do you think would be the best place to query this further?

Your right, sticking with False. One thing I am pondering is why im not getting a lot of greek results coming through. If I go to Google Translate, I can type a basic english sentence, grab the greek, and paste it into CLD, and it returns wrong language pretty much everytime. Any ideas here?

I did a comparison of CLD with our own language detection API web service (WhatLanguage.net). You can read the full comparison of WhatLanguage.net, CLD, Tika, language-detection and langid.py at http://www.whatlanguage.net/en/api/accuracy_language_detection

I don't know anything about that port ... and I don't know of any other Java ports.

However, there is a new java port of langid.py (https://github.com/saffsd/langid.py) at https://github.com/carrotsearch/langid-java ... I did some simple tests and it gets the same results as langid.py and is quite a bit faster.

There is also the language detection library https://code.google.com/p/language-detection/

Hi Mike,Just wanted to say thanks! Definitely appreciate all the initiative you took on this project. We're finding the CLD Python binding really useful.

We did come across a strange problem where CLD fails to detect the correct language when an '&' character is part of the text. I wondered if anybody else had encountered this (or maybe I've missed something obvious).

My guess is it's trying to parse an escaped HTML character? Are you passing isPlainText=False (this is the default). If so, can you open an issue with the CLD2 project? A lone & should be untouched ... but maybe CLD2 is doing something silly like throwing out the rest of the input after the &.

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.