SYNOPSIS

DESCRIPTION

Text::Language::Guess guesses a document's language. Its implementation is simple: Using Text::ExtractWords and Lingua::StopWords from CPAN, it determines how many of the known stopwords the document contains for each language supported by Lingua::StopWords.

Each word in the document recognized as stopword of a particular language scores one point for this language.

The language_guess() function takes a document as a parameter and returns the abbreviation of the language that it is most likely written in.

Supported Languages:

English (en)

French (fr)

Spanish (es)

Portugese (pt)

Italian (it)

German (de)

Dutch (nl)

Swedish (sv)

Norwegian (no)

Danish (da)

Methods

new()

Initializes the guesser with all stopwords available for all supported languges. If new has been called before, subsequent calls will return the same precomputed stoplist map, avoiding collecting all stopwords again (as long as the number of languages stays the same, see next paragraph).

You can limit the number of searched languages by specifying the language parameter and passing it an array ref of wanted languages:

# Only guess between English and German
$guesser = Text::Language::Guess->new(languages => ['en', 'de']);

language_guess($textfile)

Reads in a text file, extracts all words, scores them using the stopword maps and returns a single two-letter string indicating the language the document is most likely written in.

language_guess_string($string)

Just like language_guess, but takes a string instead of a file name.

scores($textfile)

Like language_guess($textfile), just returning a ref to a hash mapping language strings (e.g. 'en') to a score number. The entry with the highest score is the most likely one.