Open Source proof-reading tool

Add a new page

Note

This page is not up-to-date anymore. Since 2.9 LanguageTool uses language-detector for language detection.

LanguageTool uses Apache's tika library to do source language detection. Since LanguageTool supports more languages than are currently available in tika, we've created additional language profiles and add them into tika at runtime.

Tika doesn't support Belarusian, Catalan, Esperanto, Galician, Romanian, Slovak, Slovenian, Ukrainian, Malayalam, and Khmer. Language profiles have been added for all but the last two.

Adding a new language

To add a new language, you need to create an n-gram profile file. This is a collection of frequency counts for letter trigrams in natural language. Here are the steps to create a new language profile:

Get a corpus in the source language, preferably with as little formatting and foreign words as possible. I used Wikipedia article dumps and stripped out the punctuation and XML. The result is available at http://www.languagetool.org/download/language-training-data/. Although Wikipedia contains a lot of proper nouns and foreign words, I've found the language detection works fairly well in practice. An additional (kind of cheating) trick you can do is: after you've created an initial .ngp profile for a language, comb through your corpus file and use the LanguageIdentifier class to remove obviously foreign lines. (E.g. there are a lot of completely English sentences in foreign language Wikipedia).