Idea

There are at least two use cases for using a large corpus with word/phrase occurrence counts:

If a rule has more than one suggestion, offer only the more common one (or at least sort the suggestions in the order of their occurrence counts).

Find errors by looking for potential homophones (like there - their). Assume an error if the variant in the text occurs less often than its alternative, according to the occurrence counts. The list of homophones still needs to be created manually. The list is in the file confusion_sets.txt.

Note that this data has information about the year it comes from. All these (maybe ignoring years earlier than 18xx) can be aggregated to make the data take much less space. The aggregated data for English is this large (as of October 2016):

size on disk: 15GB (as Lucene indexes)

1-grams: 10,254,020 (this is also the number of documents in the Lucene index, as each ngram is stored as one document)

tokens: 230,158,826,104

Source Code and Evaluation

The rule that uses ngrams for error detection is org.languagetool.rules.en.EnglishConfusionProbabilityRule, which extends org.languagetool.rules.ConfusionProbabilityRule. It gets activated when you start LT with the --languagemodel option.

org.languagetool.dev.bigdata.ConfusionRuleEvaluator can be used to evaluate the rule for a pair of words that needs to be specified in the source code.

Use org.languagetool.dev.NGramStats for a simple way to look up occurrence counts:

Index Creation

org.languagetool.dev.FrequencyIndexCreator can be used to build a Lucene index with ngrams and their occurrence count. Building a Lucene index for all the 2grams took 10 hours on my machine (no optimization, only 1 thread). The good thing is that the index doesn't have to fit into RAM but will still be fast if it does (see below).

berkeleylm and Morfologik (which we use anyway) would be good alternatives but require all data to be loaded to RAM.

Amazon Elastic MapReduce (EMR)

The 3gram dataset is so huge that FrequencyIndexCreator takes a few days to create all Lucene indexes. It's faster if it gets aggregated data, i.e. just ngram and occurrences (no page count, no year). This data can be created with Hive in Amazon Elastic MapReduce (EMR), see the attached Hive script. Aggregating the 3grams (v1) takes about 1 hour when using 15 medium instances, the result was about 3 GB data in 235 *.gz files, cost was about US $6. Indexing the files locally with Lucene 4.9 took 1 hour, merging the 235 indexes took another hour.

Note: as of June 2015, Amazon's public data set only offers version 1 of the ngram data from 2009, not version 2 from 2012. See below for the differences. To get v2 data into Amazon S3 without downloading it locally, you can use the script below on a EC2 instance. The problem with this is that the v2 data is larger than the v1 data, e.g. by a factor of about 5 for the 3grams. That together with its gzip compression increases the time needed to do the aggregation, so that just for English v2 3ngrams the cost would be about $50US (a very rough estimate). That doesn't yet include the EC2 capacity one would need to turn the *.gzip files into smaller chunks, as Hadoop cannot split gzipped files. If some files are large (e.g. >100 GB for the punctuation file in v2) and others are small, Hadoop will not be efficient.

TODO: Uploads of files >5GB may fail when --expected-size is not included, see https://github.com/aws/aws-cli/issues/1184: client error (InvalidArgument) occurred when calling the UploadPart operation: Part number must be an integer between 1 and 10000, inclusive

With the observed download speed of 50MB/s (m4.large instance) from the Google ngram site, downloading alone will take ~7 hours, but this could be parallelized to several EC2 instances.

Google BigQuery

Google offers a different pricing model than Amazon where you only pay for the data that your queries run on, not for loading data into BigQuery. They have a trigram data set (publicdata:samples.trigrams) but I'm not sure if it's v1 or v2 of the ngram data or something else. Also, it's only trigams, nothing else. See here for some more information. A quick test with the ngram suite of empty suggstest that it's different to v1 and v2 (v1 has 117 occurrences, v2 has 260, bigquery 120), this is confirmed here.

DigitialOcean

DigitialOcean offers much cheaper hourly rates than Amazon, but has less infrastructure (like S3).

Index Lookup

org.languagetool.languagemodel.LuceneLanguageModelTest can be used for performance tests of ngram lookups. Some tests on my machine for average lookup time per ngram:

no data in OS cache, index on external USB 2,5" disk: 17,626µs = 17ms

no data in OS cache, index on internal SSD: 739µs = <0ms

all data in OS cache (by running the test more than once - the old index was 3.7GB and thus fit into my RAM): 163µs = <0ms

Automatically Create Rules

It's not clear yet if this is useful, but the data can be used to generate XML rules automatically. Use org.languagetool.dev.HomophoneOccurrenceDumper to write out data about the homophones and their context. Then, run org.languagetool.dev.RuleCreator to create XML rules for those cases where it seems to make sense most, i.e. where there's a context that's common for a word but very uncommon for a homophone of that word. For example, this might create a rule that matches "the aero keys" and suggests "the arrow keys".

Future Development

A language model should also properly work with ngrams that don't occur in the data, i.e. it should guess a probability for them. Currently (November 2015) we just assume an occurrence of 1 for cases with occurrence = 0.

A test with BerkeleyLM and its German Google Books data shows inconsistent results: sometimes the Lucene-based ngram index is better (e.g. for mir/mit, vielen/fielen), sometimes BerkeleyLM is better (e.g. Mediation/Meditation). Using BerkeleyRawLanguageModel and debugging also shows that the BereleyLM data seems to have smaller occurrence counts, even though the Lucene-based data is filtered to contain only data from years 1910 and later. Reasons are unclear, is the BerkeleyLM pre-built language model maybe built on v1 of the Google books data? (December 2015, ConfusionRuleEvaluator.java)

The Google ngram data is based on books, so it should be high quality, but it probably doesn't cover less formal languages as well. We could try to extend the data with data from commoncrawl.org. Another advantage is that this way we are not forced to use only pruned data (Google ngram data is pruned at 40).

Google offers up to 5gram data, but it depends on the confusion pair what distance one needs to consider. There are also dependencies over more than 5 tokens (e.g. "spiegelt …. wider" in German). Both might be improved by training a neural network, e.g. on commoncrawl.org data (maybe relevant: Sequence to Sequence Learning with Neural Networks).