How should I interpret confidence scores returned from language detection?

The value returned by com.basistech.rosette.dm.LanguageDetection.DetectionResult.getConfidence() in the SDK or from the "confidence" JSON value in the Rosette API is not a traditional confidence measure, but rather a score comparing the strength of results on the same document. One way to quantify the certainty of the top result is to take the ratio of its score to that of the second result. On a strong result, the difference will be roughly an order of magnitude.

For example, assume the results of a particular analysis are:

name

iso

score

English

eng

0.04165

French

fra

0.00448

Norwegian

nor

0.00380

Romanian

ron

0.00365

Dutch

nld

0.00362

The ratio of the first score (English) to the second score (French) is 0.4165 / 0.0048 = 9.2894. At nearly an order of magnitude difference, this suggests high confidence in the result.

Contrast an analysis that returns the results:

name

iso

score

Spanish

spa

0.02812

Catalan

cat

0.01524

Portuguese

por

0.01389

Romanian

ron

0.01032

French

fra

0.00934

The ratio of the first score (Spanish) to the second score (Catalan) is 0.02812 / 0.01524 = 1.8458. This suggests a much less confident result.

Normalized confidence scores

For some applications, you may want normalized confidence scores that can be compared across analyses. As a best practice for this case, we recommend summing the scores for the first 5 results, and normalizing them to add up to 1.0. The resulting figures can be interpreted as confidence scores. Note that this is still not the same as a probability or likelihood -- the likelihood that a result is correct is often much higher than its confidence score.

Take the first example above. If you normalize the results to add up to 1.0, you get:

name

iso

score

confidence

English

eng

0.04165

0.7281

French

fra

0.00448

0.0784

Norwegian

nor

0.00380

0.0664

Romanian

ron

0.00365

0.0638

Dutch

nld

0.00362

0.0633

While 72.8% doesn't sound all that confident, it's much higher than the 7.8% confidence of the second highest result. In practice, this indicates a near certainty that the answer is correct.

Contrast the second example:

name

iso

score

confidence

Spanish

spa

0.02812

0.3657

Catalan

cat

0.01524

0.1981

Portuguese

por

0.01389

0.1806

Romanian

ron

0.01032

0.1341

French

fra

0.00934

0.1215

Here the confidence of the first result, 36.6%, accurately conveys that the likelihood of an accurate result is far from certain.