N-Gram Language Guessing with NGramJ

NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.

Running

You can get a feel for how well NGramJ works by trying it on the command line. For example:

Using

Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.

Now that you know how to use the library for language guessing, I’ll show you how to add a new language.

Adding a New Language

NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:

This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.

The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:

They’re probably pretty similar. I use ngramj as it’s Java and AtD is written mostly in Sleep/Java. In my own tests I’ve found one or two words is a coin toss as to what it will characterize it as. Once you get beyond a full sentence that says something it’s always correct. I have to add the says something caveat because I’ve found it will mischaracterize a list of names, addresses, and phone numbers.

3 very interesting articles (with “Generating a Plain Text Corpus from Wikipedia” and “All about Language Model”). Did you add Arabic or Cyrillic languages or Japanese, Chinese ?

I created an Arabian ngp file, it works fine, but strangly Persian doesn’t work (score is always 0). Same thing, with Russian doesn’t work (with the ru.ngp file provided in ngramj). I suppose I make a mistake with the cngram API.