Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

The invention is to system and methods for automatically identifying the
language(s) contained in text. The system comprises two language
classifiers, one that classifies the text based on the latters present,
and a second classifier that classifies the text based on the words
present. Each classifier produces a list of languages and a weight for
each language. Each classifier also computes an overall confidence
applied to the classifier as a whole. The results of the classifiers are
combined together incorporating the classifier confidence and language
weights. The combined results produce a list of languages and weights and
an overall confidence.

Claims:

1. A system for identifying the language of text comprising: A
Combination Classifier comprising a plurality of Pattern Classifiers
containing at least one Word Classifier and at least one Letter
Classifier; Identifying input text for language classification;
Presenting the input text to the Combination Classifier; Where the
Combination Classifier presents the input text to each of the Pattern
Classifiers; Where each of the Pattern Classifiers produces: a vector of
weights where each component of the vector is the weight associated with
a particular language; and a vector of variances where each component of
the vector is the variance of the weight associated with a particular
language; Where each Pattern Classifier is associated with a weight
wherein at least one weight is different from at least one other weight;
Where the Combination Classifier computes a combination weight vector
based on the weight vectors produced from the plurality of Pattern
Classifier weight vectors; Where the Combination Classifier computes a
combination weight variance vector based on the weight variance vectors
produced by the plurality of Pattern Classifier weight variance vectors;
and Where the Combination Classifier computes a rank ordered list of
languages to associate with the input text based on the combination
weight vector and the combination weight variance vector;

2. A method for Data Preparation comprising: Identifying a set of
training documents wherein each training document is associated with at
least one language; Preprocessing each training document comprising:
Case-foldig the text of the document; Removing punctuation symbols from
the document; and Parsing the document according to a pattern where the
pattern is chosen from the group: words, letters, word pairs, or letter
pairs. Counting the number of occurrences of each pattern in all
documents associated with a particular language; Computing the frequency
of occurrence of each pattern in each language by dividing the count of
the pattern in a language by the total number of patterns matched to the
language across all documents associated with the language; Identifying a
list of common patterns by applying a threshold to the list of patterns
associate with each language; Processing each document as a sequential
list of patterns encountered and associating each pattern with a previous
and next pattern; Counting the number of occurrences of pairings of each
common pattern for each language with the previous or next pattern;
Examining each pair of languages language by: Computing the union set of
common words between the languages; Computing the intersection set of
common words between the languages; Identifying the patterns that are
unique to each language; Identifying the patterns that are common to each
language; Examining each of the patterns common to each language by:
Identifying the number of patterns paired to the pattern under
examination associated with the first language in the language pair;
Counting the number of patterns pairs to the pattern from the first
language that are exclusive to the first language; Counting the number of
pattern pairs to the pattern from the first language that are common to
both languages; Computing a set of first weights of pattern pairs for the
first language by dividing the counts by the total number of pattern
pairs from the first language; Counting the number of patterns pairs to
the pattern from the second language that are exclusive to the second
language; Counting the number of pattern pairs to the pattern from the
second language that are common to second languages; Computing a set of
second weights of pattern pairs for the second language by dividing the
counts by the total number of pattern pairs from the second language;
Computing the variance of each of the first weights; Computing the
variance of each of the second weights; and Associating the pattern with
the first language, second language, neither, or both by comparing the
first weights and second weights using a geometrical region; and
Outputting a list of patterns associated with each language;

3. A system for identifying the language of text comprising: A
Combination Classifier comprising a plurality of Pattern Classifiers;
Identifying input text for language classification; Presenting the input
text to the Combination Classifier; Where the Combination Classifier
presents the input text to each of the Pattern Classifiers; Where each of
the Pattern Classifiers produces: a vector of weights where each
component of the vector is the weight associated with a particular
language; Where the Combination Classifier computes a combination weight
vector based on the weight vectors produced from the plurality of Pattern
Classifier weight vectors; and Where the Combination Classifier computes
a rank ordered list of languages to associate with the input text based
on the combination weight vector;

Description:

BACKGROUND

[0001] Computers are becoming readily available to people around the
world. As such, a growing number of people using computers speak a
language other than English.

[0002] In addition, there are a number of software programs that desire to
present a customized user experience based on the native language of the
person using the software. To facilitate this customization, software
programs may need to automatically identify the native language of a
user.

SUMMARY

[0003] The instant invention is directed to automatically identifying the
language of a text document. The system is presented text and is asked to
determine the language (or languages) contained in the text. The text may
be short containing only a few characters, or it may be long comprising
several pages.

[0004] Moreover, the text may contain a plurality of languages. In this
case, the system is asked to identify each region of the text that
contains a specific language.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] FIG. 1 is an illustration of the process for Data Preparation for
the Word Classifier.

[0006] FIG. 2 is an illustration of the process for Data Preparation for
the Letter Classifier.

[0007] FIG. 3 is an illustration of the process for Data Preparation for
the Pattern Classifier.

[0008] FIG. 4 is an illustration of the process for classifying text with
the Word Classifier.

[0009] FIG. 5 is an illustration of the process for classifying text with
the Letter Classifier.

[0010] FIG. 6 is an illustration of the process for classifying text with
the Pattern Classifier.

[0011] FIG. 7 is an illustration of the process for classifying text with
the Comination Classifier.

[0012] FIG. 8 is an illustration detailing the computation of the
frequency of patterns based on counts. The figure also shows the patterns
exclusive to each language and the patterns common to both.

[0013] FIG. 9 is an illustration showing results of counting each common
pattern in relation to its neighboring patterns.

[0014] FIG. 10 is an illustration of a simple threshold for determining
the association of a common patter with either one language, both, or
neither.

[0015] FIG. 11 is an illustration of a more general geometry for
determining the association of a common patter with either one language,
both, or neither.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

[0016] Text language may be broken into individual words. Each word is
comprised of one or more letters. One approach to language classification
is to examine the words of the text and compare these to a list of words
associated with the language.

[0017] To this end, a first step in building a text classifier is to
create a list of words associated with each language under consideration.
Many languages have large amounts of text available online. Downloading
text from the web for each language provides an initial source of text
for a language.

[0018] However, this method has the drawback that many web text files have
more than one language embedded in the document. For example, text from a
Chinese website may have English text embedded in the document.

[0019] This leads to a circular problem. In order to build a language
classifier, we need to identify a pure source of language text. However,
in order to get pure language text, we need a language classifier t
separate the languages in the text. We present a method for separating
the languages in such mixed text files even though we do not know
precisely how to separate the text initially.

[0020] Language Identification on Words

[0021] Data Preparation

[0022] A language classifier is often enhanced by compiling a list of
words associated with each particular language. This section details the
preparation phase for such data. This section assumes the existence of
some set of machine readable documents where each document is associated
with a principal language. These documents may have other language text
embedded within. Alternatively, some documents may be associated with one
language while the text is predominately or even entirely in another
language. The process described in this section is capable of determining
which words are associated with each language even when some of the input
documents have other languages, or even when documents are incorrectly
associated with one language but written entirely in another language.
Based on this input, the process produces lists of common words for each
language. These lists may be used to enhance the language classifiers
described in the next sections.

[0023] The text used here is often called training text. This text is used
to create or train language classifiers and is distinguished from input
text that is presented to a classifier for the purpose of determining the
underlying language of the text.

[0024] First, identify training documents that are associated with each
language. Our initial investigations lead us to believe that 100-1000
such documents are sufficient when there are at least 10 words in each
document. Shorter documents may be included in this set, but longer
documents are preferred. If only short documents are available, we
recommend 500-5000 documents.

[0025] Second, for each language, parse each document into a set or words.
Normalize each word by case-folding. Simple case-folding may be
implemented as making all characters lower case. However, in some
languages this process is ambiguous. Another method is to first make all
letter upper case, then make the result lower case. This addresses many
problems encountered when using Unicode to represent the characters. The
use of Unicode is highly recommended as Unicode supports a wide-variety
of language scripts.

[0026] Also part of this step is the removal of punctuation. Symbols such
as `.`, `;`, `!`, `@`, `#`, `$`, `%`, ` `, `*`, `(`,`)`, `{`,`}`,
`[`,`]`, `\`, `:`, `?`, `<`, `>`, `/`, `"`, `|`, `˜`, `+`,
`-` and `'` are a few of the symbols that may be removed from the text.
It should be appreciated that removal of punctuation may include other
symbols than those presented here, combination of symbols may be used
(where two of more symbols appear together), or some of the above symbols
may be removed. In the simplest case, removing punctuation symbols may
use no symbols at all in which case this part of the step is ignored.

[0027] Third, count the number of appearances of each normalized word.
Normalize this by dividing each frequency by the total number words in
all documents for the particular language. The normalized value is the
frequency of the word in tat language. The sum of the frequencies of all
words in a given language should sum to one.

[0028] Fourth, rank order the word list for each language from highest
frequency to lowest frequency. Specify a cutoff value to truncate the
word list. The cutoff value may be expressed as a word frequency, or it
may be a total number of words. Alternatively, all words may be used.

[0029] Fifth, for each language, record the pairing of each rank ordered
word (words surviving the cutoff) with the previous and next normalized
words in each document. If the next or previous normalized words is not a
rank ordered word, skip the occurrence. If the next normalized word is a
rank ordered word, count the number of times this word combination
appears. The pairing data for language A is represented as PA(w)
while the pairing data for language B is represented as PB(w). This
notation means that given a particular word w, PA(w) is the list of
rank ordered words that are paired with w. This may also include the
frequency count of the pairing as well.

[0030] Sixth, for each pair of languages, create the union set of the rank
ordered word lists for both the languages. The union set is the set of
unique words that appear in either set. Thus, if one set has words A and
B, and the other set has words B and C, the union set is A, B, and C.
Note that B appears only once in the union set because the union set is a
set of unique words.

[0031] Let RA and RB be the rank ordered word lists of the two
languages. The union set is expressed as UAB=RA∪RB.

[0032] Seventh, identify the intersection of words between the languages.
The intersection is the set of unique words that appear in both
languages. Thus, if one set has words A and B, and the other set has
words B and C, the intersection set is A and C.

[0033] Let RA and RB be the rank ordered word lists of the two
languages. The intersection set is expressed as
IAB=RA∩RB.

[0034] Eighth, identify the words that are exclusive to each language in
the language pair. These are the words that appear on the rank ordered
word list for one language but not the other. The exclusive word list for
each language may be computed from the previous results. The exclusive
words for language A are EA=RA-IAB. The exclusive words
for language B are EB=RB-IAB.

[0035] Ninth, examine each of the rank ordered words that are common to
the two languages. This is the intersection IAB. For each rank
ordered word w, examine the list of word pairings for each language
(PA(w) and PB(w)). For each paired word in PA(w),
determine if the word is exclusive to A, B, or is on both lists.
Mathematically, let PAi(w) be the ith rank ordered word
paired with w for language A. Since the sets EA, EB, and
IAB are mutually exclusive (IAB∩EA=0,
IAB∩EB=0, and EB∩EA=0), then exactly
one of three choices must be true: PAi(w)εEA,
PBi(w)εEB, or PAi(w)εIAB.

[0036] For a given rank ordered word w, we count the number of paired
words that are exclusive to A (PAi(w)εEA), the
number of paired words that are exclusive to B A
(PAi(w)εEB), and the number of paired words that
are on both lists A and B (PAi(w)εIAB). Represent
the number of paired words for word w from language A that are exclusive
to A be represented as πAA(w). The number of paired words
for word w from language A that are exclusive to B be represented as
πBA(w). Finally, let the number of paired words for word w
from language A that are in both A and B be represented as
πABA(w). Optionally, these counts may be weighted by the
frequency of each rank ordered word pair, the frequency of the paired
word, or the frequency of w. Note, in this embodiment, the quantity
πBA(w)=0, but alternative embodiments may have this nonzero.

[0037] This process is repeated using the paired words from list B.
Similar to above, for a given rank ordered word w, we count the number of
paired words that are exclusive to A (PBi(w)εEA),
the number of paired words that are exclusive to B A
(PBi(w)εEB), and the number of paired words that
are on both lists A and B (PBi(w)εIAB). Represent
the number of paired words for word w from language B that are exclusive
to A be represented as πAB(w). The number of paired words
for word w from language B that are exclusive to B be represented as
πBB(w). Finally, let the number of paired words for word w
from language A that are in both A and B be represented as
πABB(w). Optionally, these counts may be weighted by the
frequency of each rank ordered word pair, the frequency of the paired
word, or the frequency of w. Note, in this embodiment, the quantity
πAB(w)=0, but alternative embodiments may have this nonzero.

[0038] Tenth, compute a weight for allocating w to either language A,
language B, or both A and B as follows. The preference of allocating w to
language A based on the text assigned to language A is computed as

[0047] The uncertainty for each of the metrics is computed as the square
root of the variance.

[0048] Twelfth, in this embodiment,
ρAB(w)=ρBA(w)=0. In this case, there are two
parameters that define the system. Since
ρAA(w)+ρABA(w)=1 and
ρBB(w)+ρABA(w)=1, there are only two
independent parameters. Use the parameters ρAA(w) and
ρBB(w) to define the system for the word w. These
parameters are on the range 0≦ρAA(w)≦1 and
0≦ρBB(w)≦1. The point (ρAA(w),
ρBB(w)) represents the state of the system for the word w.
This point is on the closed space of the unit square.

[0049] The closed space of the unit square is divided into four regions.
Region A is the set of points (ρAA(w),
ρBB(w)) where the word w is assigned to language A and is
removed from language B. Region B is the set of points
(ρAA(w), ρBB(w)) where the word w is assigned
to language B and is removed from language A. Region AB is the set of
points (ρAA(w), ρBB(w)) where the word w is
assigned to both language A and language B. Region O is the set of points
(ρAA(w), ρBB(w)) where the word w is removed
from both language A and language B.

[0050] These regions may be created using just a simple threshold. In this
case, when ρAA(w)≧ρcritical, the word w is
assigned to language A. Moreover, when
ρBB(w)≧ρcritical, the word w is assigned to
language B.

[0051] Alternatively, the regions may be created with more complicated
geometries. In this case, the problem of assigning w to a language
results in a multiobjective optimization problem. When language A and B
are not preferred over each other, the geometry of the regions should be
symmetric about the line ρAA(w)=ρBB(w).
However, when the symmetry between languages A and B is broken, the
geometry of the regions may not be symmetric.

[0052] Based on the location of the point (ρAA(w),
ρBB(w)), the word w is removed from the list of rank
ordered words for language A and/or B. This step represents the evolution
of the system from an initial set of rank ordered words to a filtered
set.

[0053] Thirteenth, the process is repeated from the eighth step forward
for each word w in the intersection set IAB.

[0054] Fourteenth, the process is repeated from the sixth step forward for
each pair of languages. If language A and B are treated symmetrically in
the process, then the result of examining language A with B is the same
as examining language B with A. In this case, we may reduce the total
number of language pairs for examination. If there are N languages,
examining every pair requires N2 repetitions. If language A and B
are treated symmetrically, then only

N ( N - 1 ) 2 ##EQU00008##

examinations are required. This count includes examining a language with
itself. If this is not desired, than an additional N examinations may be
removed resulting in

N ( N - 3 ) 2 ##EQU00009##

examinations.

[0055] Fifteenth, the process is repeated iteratively from the fourth step
forward. Each iteration removes words from each language. This alters the
rank ordered word list for each language. Repeating the process
iteratively converges each language to a fixed list of words assigned to
the language. The final lists for each language may be written out as
computer readable files.

[0056] The steps above are presented here for clarity purposes and are not
intended to limit the invention. Steps may be modified, combined, run in
parallel, or reordered in a variety of ways. This may be done in
particular for the purpose of creating efficient algorithms.

[0057] Word Classifier

[0058] Once a set of rank ordered common words is identified, a word
classifier may be created by checking input text against the rank ordered
common words. The steps for using a word classifier are detailed below.

[0059] First, each list of rank ordered common words is identified.
Preferably, these words are read into RAM in a computer program and
stored therein for fast access. In this case, each word appears uniquely
in a list, and each word is associated with a language and a frequency of
occurrence.

[0060] Second, input text for classification is provided to the
classifier. The text may be a single word or a large document. In fact,
the text may be contained across multiple documents that are intended to
be treated as a single document.

[0061] Third, the input text is processed with the methods used in step
two and three from the Data Preparation component. By preparing the input
text in with the same methods used to prepare the training data, we
assure consistency of treatment which increases the likelihood that the
normalized inputs are similar to the training inputs. However, some
variances between the methods may be allowed to facilitate differences
between the input and training sets. For example, the input set may be in
a different machine readable formant and may require conversion.
Alternatively, the input text may have document section markers that may
be exploited to use the best text for classification. There are many
reasons to treat the input text a little differently, but it is useful to
create normalized input text using a method similar to that used in
creating normalized training text.

[0062] Fourth, each word in the normalized input text is presented to the
list of unique words. The languages associated with the input word is
recorded along with the frequency of occurrence for the word in the
language. Here, each language is associated with a list of words
appearing in the input text associated with the language.

[0063] Fifth, step four is repeated for each word in the normalized input
text. If a word appears more than one time in the input text, the count
of the number of appearances of the word in the input text is recorded.

[0064] Sixth, a weight is computed for each language based on the list of
words in the text associated with the language. The weight may also
incorporate a component based on the number of words appearing in the
input text that are not associated with the language. In the one
embodiment, the weight is computed by multiplying the frequencies of
occurrence of each word in the document associated with the language:

Φ l w i .di-elect cons. I N l f l (
w i ) ρ i ##EQU00010##

where Φl is the weight associated with language l, I is the set
of normalized words from the input text, Nl is the set of normalized
words associated with the language, fl(wi) is the frequency of
the word wi in language l, and ρi is the number of
occurrences of wi in the input text.

[0065] In many cases, there are many normalized words associated with each
language. In this case, the product in the above formula contains many
terms. Because 0≦fl(wi)≦1, the resulting weight
is often very small. In fact, the resulting weight may be too small to be
represented by a computer using traditional variables. Because of this,
it is preferred to compute the logarithm of the weight. Here, the weight
is computed as

Φ l w i .di-elect cons. I N l p i ln
( f l ( w i ) ) ##EQU00011##

This representation is easier to use because the summation typically
remains computable even though the product does not.

[0066] In the preferred embodiment, the weight is corrected with a factor
for each word that does not appear in a language. Let fi be the
minimum weight for any word in language l. Let f be the minimum weight
for any word in any language. A minimum factor for each language is
computed. There are many methods for computing such a factor. Let
μl be the minimum factor for language l. Different embodiments
may use different factors. Some typical factors are

μl= fl

μl= fl/K

μl= f

μl= f/K

[0067] where K is a scaling factor and typically K≧1. Our
experimentation suggest the best mode for the invention is using the last
factor with K=10.

[0068] The minimum factor represents the probability that language l is
not the correct language given that a word is not associated with the
language. The weight associated based on words not associated with
language l is given by

where N is the total number of normalized words in the input text.
Eighth, the pairwise z-score is computed for each pair of language as

Z AB = Ω A - Ω B σ Ω A 2 +
σ Ω B 2 ##EQU00016##

Ninth, sort the weights Ωl by decreasing weight. The highest
weight is the presumptive language classification for the text. Normalize
the weights according to

Ω ^ i = Ω i l .di-elect cons. L Ω
l ##EQU00017##

where L is the set of distinct languages under consideration. The
normalized weights are on the range 0≦Ωi≦1.

[0072] The uncertainties may be normalized as well according to

σ ^ Ω l 2 = σ Ω l 2 [ l
.di-elect cons. L Ω l ] 2 ##EQU00018##

[0073] In the preferred embodiment, the output of the classifier is the
rank ordered values {right arrow over (Ω)} along with the
associated variances {right arrow over
(σ)}.sub.Ωl2.

[0074] Some embodiments desire a single language choice as the output. In
this case, we may simply select the largest Ωi. Alternatively,
the error analysis may be incorporated into the selection. In this case,
first identify the maximum weight. Let the language associated with the
maximum weight be M. Find all languages i such that

ZMi<zc

where zc is some threshold z-score. In this case we have identified
all languages that are statistically the same for their weight as
language M. From these, select the language that has the minimum value
for {right arrow over (σ)}.sub.Ωl2. This represents
the language that is considered statistically the best, but has the least
uncertainty in the value of the weight.

[0075] The steps above are presented here for clarity purposes and are not
intended to limit the invention. Steps may be modified, combined, run in
parallel, or reordered in a variety of ways. This may be done in
particular for the purpose of creating efficient algorithms.

[0076] Language Identification on Letters

[0077] Another approach to identifying the language associated with some
input text is by examining the letters present in the input text. This
Letter Classifier may be constructed in a manner similar to the Word
Classifier described above.

[0078] Data Preparation

[0079] A language classifier may be enhanced by compiling a list of
letters associated with each particular language. This section details
the preparation phase for such data. This section assumes the existence
of some set of machine readable documents where each document is
associated with a principal language. These documents may have other
language text embedded within. Alternatively, some documents may be
associated with one language while the text is predominately or even
entirely in another language. The process described in this section is
capable of determining which letters are associated with each language
even when some of the input documents have other languages, or even when
documents are incorrectly associated with one language but written
entirely in another language. Based on this input, the process produces
lists of common letters for each language. These lists may be used to
enhance the language classifiers described in the next sections.

[0080] The text used here is often called training text. This text is used
to create or train language classifiers and is distinguished from input
text that is presented to a classifier for the purpose of determining the
underlying language of the text.

[0081] First, identify text documents that are associated with each
language. Our initial investigations lead us to believe that 100-1000
such documents are sufficient when there are at least 10 letters in each
document. Shorter documents may be included in this set, but longer
documents are preferred. If only short documents are available, we
recommend 500-5000 documents.

[0082] Second, for each language, parse each document into a set or
letters. Normalize each letters by case-folding. Simple case-folding may
be implemented as making all characters lower case. However, in some
languages this process is ambiguous. Another method is to first make all
letters upper case, then make the result lower case. This addresses many
problems encountered when using Unicode to represent the characters. The
use of Unicode is highly recommended as Unicode supports a wide-variety
of language scripts.

[0083] Also part of this step is the removal of punctuation. Symbols such
as `.`, `;`, `!`, `@`, `#`, `$`, `%`, ` `, `*`, `(`,`)`, `{`,`}`,
`[`,`]`, `\`, `:`, `?`, `<`, `>`, `/`, `"`, `|`, `˜`, `+`,
`-` and `'` are a few of the symbols that may be removed from the text.
It should be appreciated that removal of punctuation may include other
symbols than those presented here, combination of symbols may be used
(where two of more symbols appear together), or some of the above symbols
may be removed. In the simplest case, removing punctuation symbols may
use no symbols at all in which case this part of the step is ignored.

[0084] Third, count the number of appearances of each normalized letter.
Normalize this by dividing each frequency by the total number letters in
all documents for the particular language. The normalized value is the
frequency of the letters in tat language. The sum of the frequencies of
all letters in a given language should sum to one.

[0085] Fourth, rank order the letter list for each language from highest
frequency to lowest frequency. Specify a cutoff value to truncate the
letter list. The cutoff value may be expressed as a letter frequency, or
it may be a total number of letters. Alternatively, all letters may be
used.

[0086] Fifth, for each language, record the pairing of each rank ordered
letter (letters surviving the cutoff) with the previous and next
normalized letters in each document. If the next or previous normalized
letter is not a rank ordered letter, skip the occurrence. If the next
normalized letter is a rank ordered letter, count the number of times
this letters combination appears. The pairing data for language A is
represented as PA(w) while the pairing data for language B is
represented as PB(w). This notation means that given a particular
letter w, PA(w) is the list of rank ordered letters that are paired
with w. This may also include the frequency count of the pairing as well.

[0087] Sixth, for each pair of languages, create the union set of the rank
ordered letter lists for both the languages. The union set is the set of
unique letters that appear in either set. Thus, if one set has letters A
and B, and the other set has letters B and C, the union set is A, B, and
C. Note that B appears only once in the union set because the union set
is a set of unique letters.

[0088] Let RA and RB be the rank ordered letter lists of the two
languages. The union set is expressed as UAB=RA∪RB.

[0089] Seventh, identify the intersection of letters between the
languages. The intersection is the set of unique letter that appear in
both languages. Thus, if one set has letter A and B, and the other set
has letter B and C, the intersection set is A and C.

[0090] Let RA and RB be the rank ordered letter lists of the two
languages. The intersection set is expressed as
IAB=RA∩l RB.

[0091] Eighth, identify the letters that are exclusive to each language in
the language pair. These are the letters that appear on the rank ordered
letter list for one language but not the other. The exclusive letter list
for each language may be computed from the previous results. The
exclusive letters for language A are EA=RA-IAB. The
exclusive letters for language B are EB=RB-IAB.

[0092] Ninth, examine each of the rank ordered letters that are common to
the two languages. This is the intersection IAB. For each rank
ordered letter w, examine the list of letter pairings for each language
(PA(w) and PB(w)). For each paired letter in PA(w),
determine if the letter is exclusive to A, B, or is on both lists.
Mathematically, let PAi(w) be the ith rank ordered letter
paired with w for language A. Since the sets EA, EB, and
IAB are mutually exclusive (IAB∩EA=0,
IAB∩EB=0, and EB∩EA=0), then exactly
one of three choices must be true: PAi(w)εEA,
PBi(w)εEB, or PAi(w)εIAB.

[0093] For a given rank ordered letter w, we count the number of paired
letters that are exclusive to A (PAi(w)εEA), the
number of paired letters that are exclusive to B A
(PAi(w)εEB), and the number of paired letters that
are on both lists A and B (PAi(w)εIAB). Represent
the number of paired letters for letter w from language A that are
exclusive to A be represented as πAA(w). The number of
paired letters for letter w from language A that are exclusive to B be
represented as πAB(w). Finally, let the number of paired
letters for letter w from language A that are in both A and B be
represented as πABA(w). Optionally, these counts may be
weighted by the frequency of each rank ordered letter pair, the frequency
of the paired letter, or the frequency of w. Note, in this embodiment,
the quantity πAA(w)=0, but alternative embodiments may have
this nonzero.

[0094] This process is repeated using the paired letters from list B.
Similar to above, for a given rank ordered letter w, we count the number
of paired letters that are exclusive to A
(PBi(w)εEA), the number of paired letters that are
exclusive to B A (PBi(w)εEB), and the number of
paired letters that are on both lists A and B
(PBi(w)εIAB). Represent the number of paired
letters for letter w from language B that are exclusive to A be
represented as πAB(w). The number of paired letters for
letter w from language B that are exclusive to B be represented as
πBB(w). Finally, let the number of paired letters for letter
w from language A that are in both A and B be represented as
πABB(w). Optionally, these counts may be weighted by the
frequency of each rank ordered letter pair, the frequency of the paired
letter, or the frequency of w. Note, in this embodiment, the quantity
πAB(w)=0, but alternative embodiments may have this nonzero.

[0095] Tenth, compute a weight for allocating w to either language A,
language B, or both A and B as follows. The preference of allocating w to
language A based on the text assigned to language A is computed as

[0104] The uncertainty for each of the metrics is computed as the square
root of the variance.

[0105] Twelfth, in this embodiment,
ρAB(w)=ρBA(w)=0. In this case, there are two
parameters that define the system. Since
ρAA(w)+ρABA(w)=1 and
ρBB(w)+ρABA(w)=1, there are only two
independent parameters. Use the parameters ρAA(w) and
ρBB(w) to define the system for the letter w. These
parameters are on the range 0≦ρAA(w)≦1 and
0≦ρBB(w)≦1. The point (ρAA(w),
ρBB(w)) represents the state of the system for the letter
w. This point is on the closed space of the unit square.

[0106] The closed space of the unit square is divided into four regions.
Region A is the set of points (ρAA(w),
ρBB(w)) where the letter w is assigned to language A and is
removed from language B. Region B is the set of points
(ρAA(w), ρBB(w)) where the letter w is
assigned to language B and is removed from language A. Region AB is the
set of points (ρAA(w), ρBB(w)) where the
letter w is assigned to both language A and language B. Region O is the
set of points (ρAA(w), ρBB(w)) where the
letter w is removed from both language A and language B.

[0107] These regions may be created using just a simple threshold. In this
case, when ρAA(w)≧ρcritical, the letter w
is assigned to language A. Moreover, when
ρBB(w)≧ρcritical, the letter w is assigned
to language B.

[0108] Alternatively, the regions may be created with more complicated
geometries. In this case, the problem of assigning w to a language
results in a multiobjective optimization problem. When language A and B
are not preferred over each other, the geometry of the regions should be
symmetric about the line ρAA(w)=ρBB(w).
However, when the symmetry between languages A and B is broken, the
geometry of the regions may not be symmetric.

[0109] Based on the location of the point (ρAA(w),
ρBB(w)), the letter w is removed from the list of rank
ordered letters for language A and/or B. This step represents the
evolution of the system from an initial set of rank ordered letters to a
filtered set.

[0110] Thirteenth, the process is repeated from the eighth step forward
for each letter w in the intersection set IAB.

[0111] Fourteenth, the process is repeated from the sixth step forward for
each pair of languages. If language A and B are treated symmetrically in
the process, then the result of examining language A with B is the same
as examining language B with A. In this case, we may reduce the total
number of language pairs for examination. If there are N languages,
examining every pair requires N2 repetitions. If language A and B
are treated symmetrically, then only

N ( N - 1 ) 2 ##EQU00026##

examinations are required. This count includes examining a language with
itself. If this is not desired, than an additional N examinations may be
removed resulting in

N ( N - 3 ) 2 ##EQU00027##

examinations.

[0112] Fifteenth, the process is repeated iteratively from the fourth step
forward. Each iteration removes letters from each language. This alters
the rank ordered letter list for each language. Repeating the process
iteratively converges each language to a fixed list of letters assigned
to the language. The final lists for each language may be written out as
computer readable files.

[0113] The steps above are presented here for clarity purposes and are not
intended to limit the invention. Steps may be modified, combined, run in
parallel, or reordered in a variety of ways. This may be done in
particular for the purpose of creating efficient algorithms.

[0114] Letter Classifier

[0115] Once a set of rank ordered common letters is identified, a letter
classifier may be created by checking input text against the rank ordered
common letters. The steps for using a letter classifier are detailed
below.

[0116] First, each list of rank ordered common letters is identified.
Preferably, these letters are read into RAM in a computer program and
stored therein for fast access. In this case, each letter appears
uniquely in a list, and each letter is associated with a language and a
frequency of occurrence.

[0117] Second, input text for classification is provided to the
classifier. The text may be a single letter or a large document. In fact,
the text may be contained across multiple documents that are intended to
be treated as a single document.

[0118] Third, the input text is processed with the methods used in step
two and three from the Data Preparation component. By preparing the input
text in with the same methods used to prepare the training data, we
assure consistency of treatment which increases the likelihood that the
normalized inputs are similar to the training inputs. However, some
variances between the methods may be allowed to facilitate differences
between the input and training sets. For example, the input set may be in
a different machine readable formant and may require conversion.
Alternatively, the input text may have document section markers that may
be exploited to use the best text for classification. There are many
reasons to treat the input text a little differently, but it is useful to
create normalized input text using a method similar to that used in
creating normalized training text.

[0119] Fourth, each letter in the normalized input text is presented to
the list of unique letters. The languages associated with the input
letter is recorded along with the frequency of occurrence for the letter
in the language. Here, each language is associated with a list of letters
appearing in the input text associated with the language.

[0120] Fifth, step four is repeated for each letter in the normalized
input text. If a letter appears more than one time in the input text, the
count of the number of appearances of the letter in the input text is
recorded.

[0121] Sixth, a weight is computed for each language based on the list of
letters in the text associated with the language. The weight may also
incorporate a component based on the number of letters appearing in the
input text that are not associated with the language. In the one
embodiment, the weight is computed by multiplying the frequencies of
occurrence of each letter in the document associated with the language:

Φ l = w i .di-elect cons. I N l f l (
w i ) ρ i ##EQU00028##

where Φl is the weight associated with language l, I is the set
of normalized letters from the input text, Nl is the set of
normalized letters associated with the language, fl(wi) is the
frequency of the letter wi in language l, and ρi is the
number of occurrences of wi in the input text.

[0122] In many cases, there are many normalized letters associated with
each language. In this case, the product in the above formula contains
many terms. Because 0≦fl(wi)≦1, the resulting
weight is often very small. In fact, the resulting weight may be too
small to be represented by a computer using traditional variables.
Because of this, it is preferred to compute the logarithm of the weight.
Here, the weight is computed as

Φ l = w i .di-elect cons. I N l ρ i
ln ( f l ( w i ) ) ##EQU00029##

This representation is easier to use because the summation typically
remains computable even though the product does not.

[0123] In the preferred embodiment, the weight is corrected with a factor
for each letter that does not appear in a language. Let fl be the
minimum weight for any letter in language l. Let f be the minimum weight
for any letter in any language. A minimum factor for each language is
computed. There are many methods for computing such a factor. Let
μl be the minimum factor for language l. Different embodiments
may use different factors. Some typical factors are

μl= fl

μl= fl/K

μl= f

μl= f/K

where K is a scaling factor and typically K≧1. Our experimentation
suggest the best mode for the invention is using the last factor with
K=10.

[0124] The minimum factor represents the probability that language l is
not the correct language given that a letter is not associated with the
language. The weight associated based on letters not associated with
language l is given by

where N is the total number of normalized letters in the input text.
Eighth, the pairwise z-score is computed for each pair of language as

Z AB = Ω A - Ω B σ Ω A 2 +
σ Ω B 2 ##EQU00034##

Ninth, sort the weights Ωl by decreasing weight. The highest
weight is the presumptive language classification for the text. Normalize
the weights according to

Ω ^ i = Ω i l .di-elect cons. L Ω
l ##EQU00035##

where L is the set of distinct languages under consideration. The
normalized weights are on the range 0≦Ωi≦1.

[0128] The uncertainties may be normalized as well according to

σ ^ Ω l 2 = σ Ω l 2 [ l
.di-elect cons. L Ω l ] 2 ##EQU00036##

[0129] In the preferred embodiment, the output of the classifier is the
rank ordered values along with the associated variances {right arrow over
(σ)}.sub.Ωl2.

[0130] Some embodiments desire a single language choice as the output. In
this case, we may simply select the largest Ωi. Alternatively,
the error analysis may be incorporated into the selection. In this case,
first identify the maximum weight. Let the language associated with the
maximum weight be M. Find all languages i such that

ZMi<zc

where zc is some threshold z-score. In this case we have identified
all languages that are statistically the same for their weight as
language M. From these, select the language that has the minimum value
for {right arrow over (σ)}.sub.Ωl2. This represents
the language that is considered statistically the best, but has the least
uncertainty in the value of the weight.

[0131] The steps above are presented here for clarity purposes and are not
intended to limit the invention. Steps may be modified, combined, run in
parallel, or reordered in a variety of ways. This may be done in
particular for the purpose of creating efficient algorithms.

[0132] In constructing the Letter Classifier, the process for Data
Preparation is modified. Rather than breaking the training data into
individual letters, in this case we break the training data into
individual letters. The overall process for preparing the data proceeds
through the same steps. However, everywhere that the original Data
Preparation refers to letters, substitute letters.

[0133] Language Identification on Patterns

[0134] Language identification on patterns generalized the processes
described above for letters and words. Here, patterns may be individual
words, individual letters, or more complicated structures.

[0135] Data Preparation

[0136] A language classifier is often enhanced by compiling a list of
patterns associated with each particular language. This section details
the preparation phase for such data. This section assumes the existence
of some set of machine readable documents where each document is
associated with a principal language. These documents may have other
language text embedded within. Alternatively, some documents may be
associated with one language while the text is predominately or even
entirely in another language. The process described in this section is
capable of determining which patterns are associated with each language
even when some of the input documents have other languages, or even when
documents are incorrectly associated with one language but written
entirely in another language. Based on this input, the process produces
lists of common patterns for each language. These lists may be used to
enhance the language classifiers described in the next sections.

[0137] The text used here is often called training text. This text is used
to create or train language classifiers and is distinguished from input
text that is presented to a classifier for the purpose of determining the
underlying language of the text.

[0138] Zeroth, identify the patterns of interest. A pattern may be a
simple as individual words or letters. In this respect, a pattern
classifier generalized the aforementioned classifiers because a pattern
classifier may reduce to either of these classifiers.

[0139] However, a pattern classifier allows additional flexibility. For
example, a pattern may be two words in a sequence. In this case, rather
than examining individual words, we examine word pairs. Alternatively, a
pattern may be two letters in sequence. Again, rather than examining each
letter in isolation, we examine pairs of letters.

[0140] Moreover, patterns are allowed to contain wildcard slots. For
examine a letter pattern such as `a*b` examines three letter sequences
that begin with the letter `a`, contain any other letter next, then have
the letter `b`. Similarly, the word sequence `my,*,dog` looks for three
words in sequence where the first word is `my`, followed by any word,
followed by the word `dog`.

[0141] Patterns may mix word and letter sequences. For example, the
pattern `my,*,doe` contains a wildcard word for the second word, and a
wildcard letter at the end of the third word. This pattern matched both
`my happy dog` and `my large dogs`.

[0142] In this preliminary step, the pattern under examination are
identified. Patterns may be specified in a particular format such as
`my,*,dog*`, or in a general format such as `w,w` where w here is meant
to represent any word. The pattern `w,w` is interpreted as examining all
patterns of two words in sequence.

[0143] Alternatively, patterns may be identified in step three below based
on the contents of the training documents. Here, the system discovers
patterns based on examining the training documents. This may be
implemented with a variety of artificial intelligence techniques such as
neural networks, genetic algorithms, statistical learning, expert
systems, or other artificial intelligence technique.

[0144] Handling of overlapping patterns should be addressed as well. For
example, when examining word pairs, the sentence `my dog is happy` may be
interpreted as containing the two patterns `my dog` and `is happy`. Here,
the two word patterns are not allowed to overlap. Thus, once one pattern
is identified, the text associated with that pattern is not allowed to
participate in another pattern. Alternatively, the sentence `my dog is
happy` may be interpreted as the three patterns `my dog`, `dog is`, and
`is happy`. Here, the two word patterns are allowed to overalp.

[0145] First, identify text documents that are associated with each
language. Our initial investigations lead us to believe that 100-1000
such documents are sufficient when there are at least 10 patterns in each
document. Shorter documents may be included in this set, but longer
documents are preferred. If only short documents are available, we
recommend 500-5000 documents.

[0146] Second, for each language, parse each document into a set or
patterns. Normalize each pattern by case-folding. Simple case-folding may
be implemented as making all characters lower case. However, in some
languages this process is ambiguous. Another method is to first make all
letter upper case, then make the result lower case. This addresses many
problems encountered when using Unicode to represent the characters. The
use of Unicode is highly recommended as Unicode supports a wide-variety
of language scripts.

[0147] Also part of this step is the removal of punctuation. Symbols such
as `.`, `;`, `!`, `@`, `#`, `$`, `%`, ` `, `*`, `(`,`)`, `{`,`}`,
`[`,`]`, `\`, `:`, `?`, `<`, `>`, `/`, `"`, `|`, `˜`, `+`,
`-` and `'` are a few of the symbols that may be removed from the text.
It should be appreciated that removal of punctuation may include other
symbols than those presented here, combination of symbols may be used
(where two of more symbols appear together), or some of the above symbols
may be removed. In the simplest case, removing punctuation symbols may
use no symbols at all in which case this part of the step is ignored.

[0148] Third, count the number of appearances of each normalized pattern.
Normalize this by dividing each frequency by the total number patterns in
all documents for the particular language. The normalized value is the
frequency of the pattern in tat language. The sum of the frequencies of
all patterns in a given language should sum to one.

[0149] Fourth, rank order the pattern list for each language from highest
frequency to lowest frequency. Specify a cutoff value to truncate the
pattern list. The cutoff value may be expressed as a pattern frequency,
or it may be a total number of patterns. Alternatively, all patterns may
be used.

[0150] Fifth, for each language, record the pairing of each rank ordered
pattern (patterns surviving the cutoff) with the previous and next
normalized patterns in each document. If the next or previous normalized
patterns is not a rank ordered pattern, skip the occurrence. If the next
normalized pattern is a rank ordered pattern, count the number of times
this pattern combination appears. The pairing data for language A is
represented as PA(w) while the pairing data for language B is
represented as PB(w). This notation means that given a particular
pattern w, PA (W) is the list of rank ordered patterns that are
paired with w. This may also include the frequency count of the pairing
as well.

[0151] Sixth, for each pair of languages, create the union set of the rank
ordered pattern lists for both the languages. The union set is the set of
unique patterns that appear in either set. Thus, if one set has patterns
A and B, and the other set has patterns B and C, the union set is A, B,
and C. Note that B appears only once in the union set because the union
set is a set of unique patterns.

[0152] Let RA and RB be the rank ordered pattern lists of the
two languages. The union set is expressed as
UAB=RA∪RB.

[0153] Seventh, identify the intersection of patterns between the
languages. The intersection is the set of unique patterns that appear in
both languages. Thus, if one set has patterns A and B, and the other set
has patterns B and C, the intersection set is A and C.

[0154] Let RA and RB be the rank ordered pattern lists of the
two languages. The intersection set is expressed as
IAB=RA∩l RB.

[0155] Eighth, identify the patterns that are exclusive to each language
in the language pair. These are the patterns that appear on the rank
ordered pattern list for one language but not the other. The exclusive
pattern list for each language may be computed from the previous results.
The exclusive patterns for language A are EA=RA-IAB. The
exclusive patterns for language B are EB=RB-IAB.

[0156] Ninth, examine each of the rank ordered patterns that are common to
the two languages. This is the intersection IAB. For each rank
ordered pattern w, examine the list of pattern pairings for each language
(PA(w) and PB(w)). For each paired pattern in PA(w),
determine if the pattern is exclusive to A, B, or is on both lists.
Mathematically, let PAi(w) be the ith rank ordered pattern
paired with w for language A. Since the sets EA, EB, and
IAB are mutually exclusive (IAB∩EA=0,
IAB∩EB=0, and EB∩EA=0), then exactly
one of three choices must be true: PAi(w)εEA,
PBi(w)εEB, or PAi(w)εIAB.

[0157] For a given rank ordered pattern w, we count the number of paired
patterns that are exclusive to A (PAi(w)εEA), the
number of paired patterns that are exclusive to B A
(PAi(w)εEB), and the number of paired patterns
that are on both lists A and B (PAi(w)εIAB).
Represent the number of paired patterns for pattern w from language A
that are exclusive to A be represented as πAA(w). The number
of paired patterns for pattern w from language A that are exclusive to B
be represented as πBA(w). Finally, let the number of paired
patterns for pattern w from language A that are in both A and B be
represented as πABA(w). Optionally, these counts may be
weighted by the frequency of each rank ordered pattern pair, the
frequency of the paired pattern, or the frequency of w. Note, in this
embodiment, the quantity πBA(w)=0, but alternative
embodiments may have this nonzero.

[0158] This process is repeated using the paired patterns from list B.
Similar to above, for a given rank ordered pattern w, we count the number
of paired patterns that are exclusive to A
(PBi(w)εEA), the number of paired patterns that
are exclusive to B A (PBi(w)εEB), and the number
of paired patterns that are on both lists A and B
(PBi(w)εIAB). Represent the number of paired
patterns for pattern w from language B that are exclusive to A be
represented as πAB(w). The number of paired patterns for
pattern w from language B that are exclusive to B be represented as
πBB(w). Finally, let the number of paired patterns for
pattern w from language A that are in both A and B be represented as
πABB(w). Optionally, these counts may be weighted by the
frequency of each rank ordered pattern pair, the frequency of the paired
pattern, or the frequency of w. Note, in this embodiment, the quantity
πAB(w)=0, but alternative embodiments may have this nonzero.

[0159] Tenth, compute a weight for allocating w to either language A,
language B, or both A and B as follows. The preference of allocating w to
language A based on the text assigned to language A is computed as

[0168] The uncertainty for each of the metrics is computed as the square
root of the variance.

[0169] Twelfth, in this embodiment,
ρAB(w)=ρBA(w)=0. In this case, there are two
parameters that define the system. Since
ρAA(w)=ρABA(w)=1 and
ρBB(w)=ρABA(w)=1, there are only two
independent parameters. Use the parameters ρAA(w) and
ρBA(w) to define the system for the pattern w. These
parameters are on the range 0≦ρAA(w)≦1 and
0≦ρBB(w)≦1. The point (ρAA(w),
ρBB(w)) represents the state of the system for the pattern
w. This point is on the closed space of the unit square.

[0170] The closed space of the unit square is divided into four regions.
Region A is the set of points (ρAA(w),
ρBB(w)) where the pattern w is assigned to language A and
is removed from language B. Region B is the set of points
(ρAA(w), ρBB(w)) where the pattern w is
assigned to language B and is removed from language A. Region AB is the
set of points (ρAA(w), ρBB(w)) where the
pattern w is assigned to both language A and language B. Region O is the
set of points (ρAA(w), ρBB(w)) where the
pattern w is removed from both language A and language B.

[0171] These regions may be created using just a simple threshold. In this
case, when ρAA(w)≧Pcritical, the pattern w is
assigned to language A. Moreover, when
ρBB(w)≧Pcritical, the pattern w is assigned to
language B.

[0172] Alternatively, the regions may be created with more complicated
geometries. In this case, the problem of assigning w to a language
results in a multiobjective optimization problem. When language A and B
are not preferred over each other, the geometry of the regions should be
symmetric about the line ρAA(w)=ρBB(w).
However, when the symmetry between languages A and B is broken, the
geometry of the regions may not be symmetric.

[0173] Based on the location of the point (ρAA(w),
ρBB(w)), the pattern w is removed from the list of rank
ordered patterns for language A and/or B. This step represents the
evolution of the system from an initial set of rank ordered patterns to a
filtered set.

[0174] Thirteenth, the process is repeated from the eighth step forward
for each pattern w in the intersection set IAB.

[0175] Fourteenth, the process is repeated from the sixth step forward for
each pair of languages. If language A and B are treated symmetrically in
the process, then the result of examining language A with B is the same
as examining language B with A. In this case, we may reduce the total
number of language pairs for examination. If there are N languages,
examining every pair requires N2 repetitions. If language A and B
are treated symmetrically, then only

N ( N - 1 ) 2 ##EQU00044##

examinations are required. This count includes examining a language with
itself. If this is not desired, than an additional N examinations may be
removed resulting in

N ( N - 3 ) 2 ##EQU00045##

examinations.

[0176] Fifteenth, the process is repeated iteratively from the fourth step
forward. Each iteration removes patterns from each language. This alters
the rank ordered pattern list for each language. Repeating the process
iteratively converges each language to a fixed list of patterns assigned
to the language. The final lists for each language may be written out as
computer readable files.

[0177] The steps above are presented here for clarity purposes and are not
intended to limit the invention. Steps may be modified, combined, run in
parallel, or reordered in a variety of ways. This may be done in
particular for the purpose of creating efficient algorithms.

[0178] Pattern Classifier

[0179] Once a set of rank ordered common patterns is identified, a pattern
classifier may be created by checking input text against the rank ordered
common patterns. The steps for using a pattern classifier are detailed
below.

[0180] First, each list of rank ordered common patterns is identified.
Preferably, these patterns are read into RAM in a computer program and
stored therein for fast access. In this case, each pattern appears
uniquely in a list, and each pattern is associated with a language and a
frequency of occurrence.

[0181] Second, input text for classification is provided to the
classifier. The text may be a single pattern or a large document. In
fact, the text may be contained across multiple documents that are
intended to be treated as a single document.

[0182] Third, the input text is processed with the methods used in step
two and three from the Data Preparation component. By preparing the input
text in with the same methods used to prepare the training data, we
assure consistency of treatment which increases the likelihood that the
normalized inputs are similar to the training inputs. However, some
variances between the methods may be allowed to facilitate differences
between the input and training sets. For example, the input set may be in
a different machine readable formant and may require conversion.
Alternatively, the input text may have document section markers that may
be exploited to use the best text for classification. There are many
reasons to treat the input text a little differently, but it is useful to
create normalized input text using a method similar to that used in
creating normalized training text.

[0183] Fourth, each pattern in the normalized input text is presented to
the list of unique patterns. The languages associated with the input
pattern is recorded along with the frequency of occurrence for the
pattern in the language. Here, each language is associated with a list of
patterns appearing in the input text associated with the language.

[0184] Fifth, step four is repeated for each pattern in the normalized
input text. If a pattern appears more than one time in the input text,
the count of the number of appearances of the pattern in the input text
is recorded.

[0185] Sixth, a weight is computed for each language based on the list of
patterns in the text associated with the language. The weight may also
incorporate a component based on the number of patterns appearing in the
input text that are not associated with the language. In the one
embodiment, the weight is computed by multiplying the frequencies of
occurrence of each pattern in the document associated with the language:

Φ l w i .di-elect cons. I N l f l (
w i ) ρ i ##EQU00046##

where Φ is the weight associated with language l, I is the set of
normalized patterns from the input text, Nl is the set of normalized
patterns associated with the language, fl(wi) is the frequency
of the pattern wi in language l, and ρi is the number of
occurrences of wi in the input text.

[0186] In many cases, there are many normalized patterns associated with
each language. In this case, the product in the above formula contains
many terms. Because 0≦fl(wi)≦1, the resulting
weight is often very small. In fact, the resulting weight may be too
small to be represented by a computer using traditional variables.
Because of this, it is preferred to compute the logarithm of the weight.
Here, the weight is computed as

Φ l = w i .di-elect cons. I N l ρ i
ln ( f l ( w i ) ) ##EQU00047##

This representation is easier to use because the summation typically
remains computable even though the product does not.

[0187] In the preferred embodiment, the weight is corrected with a factor
for each pattern that does not appear in a language. Let fl be the
minimum weight for any pattern in language l. Let f be the minimum weight
for any pattern in any language. A minimum factor for each language is
computed. There are many methods for computing such a factor. Let
μl be the minimum factor for language l. Different embodiments
may use different factors. Some typical factors are

μl= fl

μl= fl/K

μl= f

μl= f/K

where K is a scaling factor and typically K≧1. Our experimentation
suggest the best mode for the invention is using the last factor with
K=10.

[0188] The minimum factor represents the probability that language l is
not the correct language given that a pattern is not associated with the
language. The weight associated based on patterns not associated with
language l is given by

where N is the total number of normalized patterns in the input text.
Eighth, the pairwise z-score is computed for each pair of language as

Z AB = Ω A - Ω B σ Ω A 2 +
σ Ω B 2 ##EQU00052##

Ninth, sort the weights Ωl by decreasing weight. The highest
weight is the presumptive language classification for the text. Normalize
the weights according to

Ω ^ i = Ω i l .di-elect cons. L Ω
l ##EQU00053##

where L is the set of distinct languages under consideration. The
normalized weights are on the range 0≦Ωi≦1.

[0192] The uncertainties may be normalized as well according to

σ ^ Ω l 2 = σ Ω l 2 [ l
.di-elect cons. L Ω l ] 2 ##EQU00054##

[0193] In the preferred embodiment, the output of the classifier is the
rank ordered values along with the associated variances {right arrow over
(σ)}.sub.Ωl2.

[0194] Some embodiments desire a single language choice as the output. In
this case, we may simply select the largest Ωl. Alternatively,
the error analysis may be incorporated into the selection. In this case,
first identify the maximum weight. Let the language associated with the
maximum weight be M. Find all languages i such that

ZMi<zc

where zc is some threshold z-score. In this case we have identified
all languages that are statistically the same for their weight as
language M. From these, select the language that has the minimum value
for {right arrow over (σ)}.sub.Ωl2. This represents
the language that is considered statistically the best, but has the least
uncertainty in the value of the weight.

[0195] The steps above are presented here for clarity purposes and are not
intended to limit the invention. Steps may be modified, combined, run in
parallel, or reordered in a variety of ways. This may be done in
particular for the purpose of creating efficient algorithms.

Language Identification on Classifier Combinations

[0196] The performance of language identification on text may be enhanced
by using multiple classifiers to classify the text, then combining the
results into a single set of outputs. In the previous section we showed
that the Pattern Classifier generalizes both the word and letter
classifier in the sense that a Pattern Classifier may reduce to a Word
Classifier or Letter Classifier when the patterns take particular forms.

[0197] In this section we assume that a set of n Pattern Classifiers are
used, and the output for the ith Pattern Classifier has normalized
weights {circumflex over (Ω)}il and normalized variances
{circumflex over (σ)}il2 where/is associated with a
particular language. Both and {circumflex over (Ω)}il and
{circumflex over (σ)}il2 are matrices where one index
runs over the n Pattern Classifiers and the other index runs over the
available languages.

[0198] Combination Classifier

[0199] First, input text is identified for language classification. The
input text is presented to each of the Pattern Classifiers and the
results for each are obtained. This provides the raw data {circumflex
over (σ)}il and {circumflex over (σ)}il2
required for the Combination Classifier.

[0200] Second, a weight may be associated with each classifier pertaining
to the confidence the classifier has in its results. Let pi be the
weight associated with the ith Pattern Classifier.

[0201] Preferably, this weight is based on the content of the input text
under consideration in light of testing performed on each Pattern
Classifier. For example, experience may lead us to believe that a Letter
Classifier is always about 95% accurate. Alternatively, we may find that
a word classifier is 50% accurate with the input text has less than 10
words, 75% accurate when the input text has between 10 and 50 words, and
99% accurate when the input text has 100 words or more. These general
accuracy measurements may be used as weights for the respective
classifiers.

[0202] Incorporating experienced based weighting for the Pattern
Classifiers helps to improve the overall performance of the Combination
Classifier. In this respect, the results of a Pattern Classifier that is
known to perform well in a certain situation may be weighted higher than
a Pattern Classifier that is known to perform poorer under the
circumstances. Moreover, the weights may be adjusted over time based on
feedback to the system. This allows the Combination Classifier to learn
from experience and improve its performance over time without needing to
add additional Pattern Classifiers or modify the existing Pattern
Classifiers.

[0203] Alternatively, we may choose pi=pj for every i and j.
This choice effectively ignores the weight in the following steps.

[0204] Third, compute a combination weight for each language as follows:

l = p l N i = 1 N Ω ^ il ##EQU00055##

[0205] Fourth, compute a combination variance for each language as
follows:

σ l 2 = p l 2 N i = 1 N σ ^ il 2
##EQU00056##

[0206] Fifth, identify the language with the maximum value for Max.
This is the presumptive language choice for the input text.

[0207] Sixth, identify all languages where

Z MB = Max - B σ Max 2 + σ B 2 < Z C
##EQU00057##

where Zc is a critical z-score threshold value that determines when
two combination weights are considered statistically different.

[0208] Seventh, from the list of languages considered statistically
similar to Max, select the language where σl2 has
the minimum value.

[0209] Extensions

[0210] The above embodiments are presented using statistical analysis
often referred to as frequentist statistics. It should be appreciated
that these results may be extended to incorporate Bayesian statistics as
well.

[0211] It should be apparent from the foregoing that an invention having
significant advantages has been provided. While the invention is shown in
only a few of its forms, it is not just limited to the embodiments shown,
but is susceptible to various changes and modifications without departing
from the spirit thereof.

Examples and Drawings

[0212] The aforementioned Language, Letter, and Pattern Classifiers may
best be understood through means of examples of preferred embodiments.

[0213] FIG. 1 shows a flowchart for the process of Data Preparation for
the Word Classifier. The process begins by identifying the training
documents to use with Data Preparation. Each document is preprocessed to
remove undesired characters, case folded, and parsed into words. The
number of occurrences of each word is counted. The total number of words
is computed, and each count is divided by the total number of words to
compute the frequency of occurrence of each word. The list of words are
arranged according to their frequency, and optionally, a cutoff is
applied. This results in a list of the most common words for the
language. Then each document is examined to identify the location of each
word on the common word list, and the immediate predecessor or successor
word is identified. If the predecessor/successor is also on the list of
common words, a count is increments for the word pair. This process is
repeated for each language resulting in a common word and common pair
list for each language.

[0214] Once this is completed, each pair of languages is processed by
identifying the common words in both languages. Based on this, the words
that are unique to each language are identified, as well as the words
that are common to both languages. For each word that is common to both
languages, the language allocation weights are computed. The pairings of
the word is examined in each language respectively. All words that are
paired with this word are identified. For the words paired to this word,
a count is made of the number of paired words that are exclusive to the
language vs the number of paired words that are in common to both
languages. Once the language weight allocations are computed, the
variances of the language weight allocations are computed. A
determination to assign the word to each language is made using geometry
in the allocation space. Based on this, the word may be assigned to one
of the languages, both, or neither.

[0215] This is repeated for each word common to both languages. Then the
process is repeated for each pair of languages. Finally, the entire
process may be repeated iteratively to achieve convergence of the common
word lists for each language. The Data Preparation process results in
creating common words files for each language under consideration.

[0216] FIG. 2 shows a flowchart for the process of Data Preparation for
the Letter Classifier. The process begins by identifying the training
documents to use with Data Preparation. Each document is preprocessed to
remove undesired characters, case folded, and parsed into letters. The
number of occurrences of each letter is counted. The total number of
letters is computed, and each count is divided by the total number of
letters to compute the frequency of occurrence of each letter. The list
of letters are arranged according to their frequency, and optionally, a
cutoff is applied. This results in a list of the most common letters for
the language. Then each document is examined to identify the location of
each letter on the common letter list, and the immediate predecessor or
successor letter is identified. If the predecessor/successor is also on
the list of common letters, a count is increments for the letter pair.
This process is repeated for each language resulting in a common letter
and common pair list for each language.

[0217] Once this is completed, each pair of languages is processed by
identifying the common letters in both languages. Based on this, the
letters that are unique to each language are identified, as well as the
letters that are common to both languages. For each letter that is common
to both languages, the language allocation weights are computed. The
pairings of the letter is examined in each language respectively. All
letters that are paired with this letter are identified. For the letters
paired to this letter, a count is made of the number of paired letters
that are exclusive to the language vs the number of paired letters that
are in common to both languages. Once the language weight allocations are
computed, the variances of the language weight allocations are computed.
A determination to assign the letter to each language is made using
geometry in the allocation space. Based on this, the letter may be
assigned to one of the languages, both, or neither.

[0218] This is repeated for each letter common to both languages. Then the
process is repeated for each pair of languages. Finally, the entire
process may be repeated iteratively to achieve convergence of the common
letter lists for each language. The Data Preparation process results in
creating common letters files for each language under consideration.

[0219] FIG. 3 shows a flowchart for the process of Data Preparation for
the Pattern Classifier. The process begins by identifying the training
documents to use with Data Preparation. Each document is preprocessed to
remove undesired characters, case folded, and parsed into patterns. The
number of occurrences of each pattern is counted. The total number of
patterns is computed, and each count is divided by the total number of
patterns to compute the frequency of occurrence of each pattern. The list
of patterns are arranged according to their frequency, and optionally, a
cutoff is applied. This results in a list of the most common patterns for
the language. Then each document is examined to identify the location of
each pattern on the common pattern list, and the immediate predecessor or
successor pattern is identified. If the predecessor/successor is also on
the list of common patterns, a count is increments for the pattern pair.
This process is repeated for each language resulting in a common pattern
and common pair list for each language.

[0220] Once this is completed, each pair of languages is processed by
identifying the common patterns in both languages. Based on this, the
patterns that are unique to each language are identified, as well as the
patterns that are common to both languages. For each pattern that is
common to both languages, the language allocation weights are computed.
The pairings of the pattern is examined in each language respectively.
All patterns that are paired with this pattern are identified. For the
patterns paired to this pattern, a count is made of the number of paired
patterns that are exclusive to the language vs the number of paired
patterns that are in common to both languages. Once the language weight
allocations are computed, the variances of the language weight
allocations are computed. A determination to assign the pattern to each
language is made using geometry in the allocation space. Based on this,
the pattern may be assigned to one of the languages, both, or neither.

[0221] This is repeated for each pattern common to both languages. Then
the process is repeated for each pair of languages. Finally, the entire
process may be repeated iteratively to achieve convergence of the common
pattern lists for each language. The Data Preparation process results in
creating common patterns files for each language under consideration.

[0222] FIG. 4 shows the process of applying the Word Classifier to input
text. First, the list of common words from the Word Classifier Data
Preparation phase is rank ordered according to frequency. Then a target
input text is identified for analysis. The input text is processed
similar to the processing of the training documents for the Word
Classifier Data Preparation phase. Each normalized word in the input text
is compared to the list of common words for the Word Classifier. From
this, a weight is computed for each language under consideration. In
addition, the variances of the weights are also computed. The maximum
language weight is identified. Next, the z-score is computed for each
pair between the maximum language and each other language under
consideration. All languages that are statistically similar to the
maximum are identified. Among this set of languages, the language with
the smallest weight variance is selected.

[0223] FIG. 5 shows the process of applying the Letter Classifier to input
text. First, the list of common letters from the Letter Classifier Data
Preparation phase is rank ordered according to frequency. Then a target
input text is identified for analysis. The input text is processed
similar to the processing of the training documents for the Letter
Classifier Data Preparation phase. Each normalized letter in the input
text is compared to the list of common letters for the Letter Classifier.
From this, a weight is computed for each language under consideration. In
addition, the variances of the weights are also computed. The maximum
language weight is identified. Next, the z-score is computed for each
pair between the maximum language and each other language under
consideration. All languages that are statistically similar to the
maximum are identified. Among this set of languages, the language with
the smallest weight variance is selected.

[0224] FIG. 6 shows the process of applying the Pattern Classifier to
input text. First, the list of common patterns from the Pattern
Classifier Data Preparation phase is rank ordered according to frequency.
Then a target input text is identified for analysis. The input text is
processed similar to the processing of the training documents for the
Pattern Classifier Data Preparation phase. Each normalized pattern in the
input text is compared to the list of common patterns for the Pattern
Classifier. From this, a weight is computed for each language under
consideration. In addition, the variances of the weights are also
computed. The maximum language weight is identified. Next, the z-score is
computed for each pair between the maximum language and each other
language under consideration. All languages that are statistically
similar to the maximum are identified. Among this set of languages, the
language with the smallest weight variance is selected.

[0225] FIG. 7 shows the process of applying the Combination Classifier to
a plurality of Pattern Classifiers. Input text is identified for
classification. This text is presented to each of the Pattern
Classifiers. A Pattern Classifier weight is computed based on the input
text under consideration. With this and the output of each classifier, a
combination weight is computed for each language. The variance of each of
these combination weights is also computed. The maximum combination
weight is identified, along with all combination weights that are
statistically similar to the maximum. From this set of languages, the
language with the smallest combination weight variance is selected.

[0226] FIG. 8 illustrates a simple example of processing two languages.
Here, the languages have patterns such as words, letters, and word pairs.
The count of occurrence of each pattern is tallied for each language.
From this, a frequency for each pattern is computed by dividing the
respective count by the total number of counts. Furthermore, the patterns
that are exclusive to each language are determined, along with the
patterns that are common to both languages.

[0227] FIG. 9 shows tables that may result from examining the patterns
common to both languages form FIG. 8. Here, when examining training
documents that are presumptively English, the term `jacob` appears paired
with 1500 different patterns that are exclusively English, and 3000
different patterns that are common to both English and Spanish.
Similarly, when examining training documents that are presumptively
Spanish, the term `jacob` appears paired with 500 different terms that
are exclusively Spanish, and 100 terms that are common to both English
and Spanish. Similar results are shown for the term `a`. From this, the
relative frequency for the English and Spanish terms is computed by
dividing the results for each language by the total number of paired
words.

[0228] FIG. 10 shows a diagram of a simple threshold geometry for the
allocation of a term to a language. For each word, the relative frequency
in each language is computed and plotted as a point in this figure. If
the point lies in the `Spanish Only` region, the term is left on the list
for common words in Spanish, but removed from the list of common words in
English. Alternatively, if the point lies in the `English Only` region,
the term is left on the list for common words in English, but removed
from the list of common words in Spanish. If the point lies in the `Both`
region, the term is left on the list for common words for both English
and Spanish. Finally, if the term list in the `Neither` region, the term
is removed from the list of common words for both English and Spanish.

[0229] FIG. 11 shows a diagram of a more complicated geometry for the
allocation of a term to a language. For each word, the relative frequency
in each language is computed and plotted as a point in this figure. If
the point lies in the `Spanish Only` region, the term is left on the list
for common words in Spanish, but removed from the list of common words in
English. Alternatively, if the point lies in the `English Only` region,
the term is left on the list for common words in English, but removed
from the list of common words in Spanish. If the point lies in the `Both`
region, the term is left on the list for common words for both English
and Spanish. Finally, if the term list in the `Neither` region, the term
is removed from the list of common words for both English and Spanish.