Introduction

Quite some time ago, I published an article on how to detect the encoding of an given text. In this article, I describe the next step on the long way to text classification: the detection of language.

The given solution is based on n-gram and word occurrence comparison.

It is suitable for any language that uses words (this is actually not true for all languages).

Depending on the model and the length of the input text, the accuracy is between 70% (only short Norwegian, Swedisch and Danisch classified by the "all" model) and 99.8%, using the "default" model.

Background

The language detection of a written text is probably one of the most basic tasks in natural language processing (NLP). For any language depending processing of an unknown text, the first thing to know is which language the text is written in. Luckily, it is one of the easier challenges that NLP has to offer. The approach I have chosen to implement is widely known and pretty straightforward. The idea is that any language has a unique set of character (co-)occurrences.

The first step is to collect those statistics for all languages that should be detectable. This is not as easy as it may sound in the first place. The problem is to collect a large set of test data (plain text) that contains only one language and that is not domain specific. (Only newspaper articles may lack the use of the word “I” and direct speech. Using Shakespeare plays will not be the best approach to detect contemporary texts. Medical articles tend to contain too many domain specific terms which are not even language specific (major, minor, arteria etc…).) And if that would not be hard enough, the texts should not be copyrighted. (I am not sure if this is a true requirement. Are the results of statistical analytics of copyrighted texts also copyrighted?) I have chosen to use Wikipedia as my primary source. I had to do some filtering to "clean" the sources from the ever present English phrases that occur in almost any article – no matter what language they are written in (I actually used Babel itself to detect the English phrases). The clean up was in no way perfect. Wikipedia contains a lot of proper names (i.e., band names) that often contain a “the” or an “and”. This is why those words occur in many languages even if they are not part of the language. This must not necessarily be a disadvantage, because Anglicism is widely spread across many languages. I created three statistics for each language:

Character set

Some languages have a very specific Character set (e.g., Chinese, Japanese, and Russian); for others, some characters give a good hint of what languages come in question (e.g., the German Umlauts).

N-Grams

After tokenizing the text into words (where applicable), the occurrences of each 1, 2, and 3-grams were counted. Some n-grams are very language specific (e.g., the "TH" in English).

Word list

The last source of disambiguation is the actually used words. Some languages (like Portuguese and Spanish) are almost identical in used characters and also the occurrences of the specific n-grams. Still, different words are used in different frequencies.

A set of statistics is called a model. I have created some subsets of the "all" model that meet my needs the best (see table below). The "common" model contains the 10 most spoken languages in the world. The “small” and “default” are based on my usage scenarios. If you are from another part of the world, your preferences might be different. So please take no offence in my choice of what languages are contained in which model.

All statistics are ordered and ranked by their occurrences. Within the demo application, all models can be studied in detail. Classification of an unknown text is straightforward. The text is tokenized and the three tables for the statistics are generated. The result table is compared to all tables in the model, and a distance is calculated. The comparison table from the model that has the smallest distance to the unknown text is most likely the language of the text.

Sample model

Using the code

Quick word about the code

Babel is part of a larger project. I wanted the Babel assembly to work stand-alone. Since some of the used classes originally were scattered across many assemblies, I used the define "_DIALOGUEMASTER" to indicate whether to use the DialogueMaster™ assemblies or implement (a probably simpler) version in place.

Any impartand DialogueMaster™ class is remotable. The clients need only one assembly containing all the interface definitions. This is why Babel uses so many interfaces where they might seem to bloat the code in the first place. Additionally, DialogueMaster™ offers lots of PerformanceCounters. I chose to omit them for an easier usage of the assembly (no installation and no admin rights needed).

What I actually want to say is: the code is not as readable and clean as it could (and should) be.

Classify text

Usage of the code is straightforward. First, you must chose (or create your own) model. The ClassifyText method returns a ICategoryList which is a list of ICateogry (name-score pairs) items sorted descending by their score.

Share

About the Author

Carsten started programming Basic and Assembler back in the 80’s when he got his first C64. After switching to a x86 based system he started programming in Pascal and C. He started Windows programming with the arrival of Windows 3.0. After working for various internet companies developing a linguistic text analysis and classification software for 25hours communications he is now working as a contractor.

Comments and Discussions

Firstly I would like to thank you for sharing your code with us. I am trying to write a new text detection utility based on similar lines as that of yours. Thus would like to use the file all.model to build the models.

Can you please share more information about its format, how I can edit it or use it in my application.
I really appreciate your help.

I don't know C# and never compiled a C# code. Any hint on how do I compile this program? Is there a readme file? Is there a way to call DialogueMaster from the command line? I need to identify the langauge for a large set of lines and can't work with GUI. Thanks.

Hi,
I tried to test Program3, to add a new language, but at Babel.cs i get array out of bounds error, at line maxScore = charsetVoters[0].Score;
I think my problem occurs at(in Babel.cs):
double score = catTable.CharsetTable.CharsetComparisonScore(tblTest.CharsetTable, threshold);
variable score is always NaN.
Do I do something wrong in the txt learn data files?

the models are bundled in binary form with the source. They are generated from the corresponsding wikipedia corpora (which are a few GB to much to be uploaded to codeproject).
You are free to add ore replace any language as you like.

thank you so much is there anyway for you to look at that code and see what is the relationship beetween the result of ngram and word tables! i saw your code but i don't understand well that what you have done ! i find the classify text function but i can't understand it well! and i am sorry again that my English is not well :(

hi i have one question about the way you calculate the result.( in n-gram tables and in word tables)? do you add the result of ngram tables to the result of words tables or something like that? how do you compare the values of n-gram table of the text with values of the words table? thank you for your time

it is something like that I actually don't have the code right here at hand...
I remember, that the algorythm was a combination of the three tables for charset, NGram and words. But I do not rememeber how the weighting was actually done. I'd need a look at the code for that...

i want to understand your code and use the training data for persian and arabic language! how can i find this data? i have some data but those are not enough! i need help! at least explain more about the code! thank you! can you send me the data in prsian and arabic?

thanks a lot.i have another question: i just want to know how to use your program and your model only for 2 or 3 languages! how to omit other language from the list and form the source! i m new with C#.

hi. how can i change your model in one of language? i want to change your model in persian? and i need your help! please help me! actually i want to add some words in to yaour model. these words exist in persian but not exist in your model! sorry ican not wrote english well!

the model is actually not ment to be altered manually. Even though you can simply load a model, Change, delete or add entries to the tables and store it back you will most probably not be able to figure resonable values for the relative number of occurences.

Please be also aware, that the models do only contain the top n words of each language. So most probably the words you are missing are not among the top occurrences within my learn data (Wikipedia). Normally you should not need to add words to the corpus, since they are mostly used to disambiguate between very similar languages (swedisch, norge and danish).

What you could do, would be to create your own language corpus, that is closer to the kind of texts you try to detect. Even though, the Wikipedia corpus is huge, it is the best for any use-case. I.E. if you detect mostly colloquial text you might get better results with you own model.

when i run your tool in .net 2008 i have problem in library DialogueMaster.Classification

can you add this library ?

the problem is

"
Error 1 The type or namespace name 'Classification' does not exist in the namespace 'DialogueMaster' (are you missing an assembly reference?) MyPath\Babel.src\DialogueMaster.Babel\DialogueMaster.cs 14 25 DialogueMaster.Babel

I'm Ronen and I'm using your code project application to detect a written text's languages. I have few questions, it will be great if you can answer few question. do you have any email adress that I can contact you with.