Add a new page

Here are some ideas of possible LanguageTool features. If you have questions, comments, or would like to suggest another task, please send a message to the forum. The ideas are roughly sorted by the amount of work they require, and in those categories (short, medium, long) the tasks we consider more important come first.

Write a test that makes sure our Morfologik-based speller works as good as Hunspell (both for error detection and making correction suggestions)

Enable using multiple rule sets

Enable using multiple rule sets (different XML files) to implement custom sets that implement different style guides. For example, one could implement Chicago Manual of Style checker that is run only for scientific papers: the user would activate the standard English rules along with the custom sets.

A more advanced version would enable loading the rules from a web-based repository of custom rules.

Build a packager that takes a glossary in CSV or tabbed format and outputs a bitext XML rule (and allows to use CSV also directly): read the contents, tokenize and analyze with LT, build the rule in memory (optionally write it to disk or use it as well)

Add target words from the glossary to the list of words ignored by the spell-checking rule

In more advanced version, this could support also TBX (XML terminology format)

Two interfaces: with a UI to answer questions (drop down a file, answer several questions, and get the file), and the command-line

This feature would be nicer if accompanied with an ability to load multiple rule sets

Requires: Java, Contact: Marcin Miłkowski

Add TMX and XLIFF readers for bitext checking

New classes for reading and writing TMX (possibly based on JAXB, and using XSLT to convert TMX to the current format) are needed to add real-world support for bitext checking

Difficulty: moderately easy, requires just a bit of tweaking. The only difficulty is to support internal tags in XLIFF. Probably two kinds of XLIFF output would be needed: with corrections applied directly to the target text, and corrections as comments.

Requires: Java

Contact: Marcin Miłkowski

Port usable English rules from XML copy editor

Rules from XML Copy Editor for English could be interesting for LT (see its source in the xmlcopyeditor-1.0.9.5/src/rulesets directory).

We use neural networks to detect confusion between words. So far, it only considers 2 words of context in both directions. Extend this so the complete sentence is considered to better detect errors that depend on long-term dependencies.

Consider a seq2seq approach

Improve spell checker suggestions

Suggestions for misspellings (or actually any suggestions in general) should consider the word right and left context, as does After the Deadline (section "The Spelling Corrector"). A lot of data is needed for that, but the existing ngram data can be used. However, this ngram data needs to be combined with the similarity data - words that are more similar to the original word should be preferred. Furthermore, we have data about which suggestion is selected by users. This should also be taken into account. Just like in After the Deadline, a neural net could learn how each factor needs to be weighted to get the best result.

Improve Performance

LanguageTool already uses several threads, but still doesn't always use 100% of the CPU even when busy. This could be optimized for the desktop use case.

Enhance quality and speed of English chunking

The English chunker is the slowest part of the English processing pipeline. This may be due to the fact that we need to use its POS tagger first. Check if the statistical POS tagger may be replaced simply by adding more hand-crafted disambiguation rules (with, maybe, simple frequency voting for non-disambiguated tokens).

In some rules, we'd need to specify nouns that are hyponyms of "human being" to find incorrect uses of phrases. Create a lexicon extracted from English Wordnet (as finite-state machine) and add appropriate syntactic sugar to XML rules so that it would be usable (i.e., attribute is-a="person").

JEdit plugin: similar to spell-checking plugins available for it already

Scribus plugin

QuarkXpress, Adobe Pagemaker integration

Bitext check for placeables / numbers

In translated text, formatting elements or numbers should be left alone or converted to other units. Create a rule that (a) aligns the formatting elements / numbers on a token level; (b) marks up the elements that were not successfully aligned. Use Numbertext to align figures translated into text (i.e., 1 translated into "one").

There is similar code in Java in the translation QA tool CheckMate. This is also available on LGPL, so one could reuse the code (or call Okapi library).

Contact: Marcin Miłkowski

Create an automatic extractor of rules based on transformation-based learning algorithm

Add an abstract interface that would allow using our finite-state dictionaries for classifying words for purposes other than POS-tagging (а valency lexicon).

Add new small dictionary with valency information for Russian (with participle and adjective).

Prepare 10 rules that uses valency checks for Russian.

Contact: Yakov Reztsov

Take an orphaned language and make it state of the art

Several languages no longer have active maintainers — we look for new maintainers and co-maintainers.

The task consists of adding rules for a language, either AI-based, statistics-based, XML-based, or using Java.

Requires: a very good command of the given language, knowledge of AI, XML, or Java

Long-term ideas

Train a statistical tagger for English

The standard statistical taggers obscure mistakes in the text because source corpora are tagged using intended tags, not the ones that were actually used. We might try to train a HMM tagger (such as Jitar), which for English should get us around 98% correctness. But for this, we need to change the tagging of the Brown corpus: change the original "correct" tags to the ones found by the LT tagger dictionary (if there is mismatch). For example, change places where "have" is tagged as "VBZ" to "VB".

This requires a smart way to automatically retag the source corpus (to retain the disambiguation) and possibly some level of manual disambiguation as well. For this reason, this (otherwise easy) task may be time-consuming.

If the method works, it may be applied to other languages as well to help with disambiguation. To test whether it does, we need to make sure that no rule in English (or any other language) is broken by the new tagger.

Contact: Marcin Miłkowski

Integrate a dependency parser

Some English rules would really benefit from deeper parsing and, in particular, recognizing the subject of the sentence (this would be useful for agreement rules). MaltParser seems to be fast and nice, but it's based on Penn Treebank, and it is not completely free. So it would be required to train a new model, for example based on Copenhagen Treebank.

Alternatively, maybe some shallow parsing would be needed, for example to identify NPs and VPs, as well as the heads of expressions (and their grammatical features)