Raw results

The following matrix shows the LanguageIdentifierPlugin processing time in ms for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis. The Data Size row is the size of data in bytes used in each file to perform the identification. Other rows represent the following configurations:

Graphical representation

Graphical representation (log axis)

Discussion

The NUTCH-60-050607.patch increases performances from 18.27% to 70.29% with an average of 24.33%.

The profiling of the code confirms what SamiSiren suggests in a previous message: "the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there". Profiling confirms this point and shows that the splitting of the text takes around 25% of the whole process.

Precision

Data set

These precision benchmarks were produced by testing the LanguageIdentifierPlugin on the Data Size first bytes from a set of :