Articles

(No.13) Language and Words

The word is a component of a sentence. Every language has words. That being the case, how many words are in the Japanese language? And what about English? To answer this simple question is not easy. Should we count words that appear only once? How should we define a word from foreign language, or one used temporarily or as a pun? How about dialects? If words of a dialect are used once but carry an important message, they should not be readily abolished. English and Japanese are said to add 2000 new words yearly. On the other hand, the French have tried to protect their language, attempting to authorize each word. It is possible to continue this policy? New concepts require new expressions. Under the French restrictions, people would then need to wait to wait for authorization before use.

Look at words another way. How can we define the number of words? Figure 1 shows the relationship between the frequency at which words appear versus their rank. This graph never reaches zero when the size of the corpus increases infinitely.

One of the most famous rules of word frequency is Zipf's law. (Zipf,"Selective Studies and the Principle of Relative Frequency in Language," Cambridge, MA; MIT Press, 1932) The law is expressed as

p(r) · r
α =Const · · · ·(1)

where
r is the rank of the word,
α the constant that depends on the nature of text, and
p(r) the frequency of the word with rank
r. Equation (1) could be modified as

p(r)=Const · r
-α.

If we represent the total number of words by N, the above expression can be transformed as

N · p(r)=Const · N / r
α

which is the estimated frequency of the word (rank
r) in the
N-word corpus. This equation tells us that any rank word has a possibility to exist when
N is large. However, Zipf's law is not accurate when the corpus is not large. Zipf's law can be improved as follows:

G =log(
N / L) / {log(
N )-
C1} · · · ·(2)
C1 ≈ 1.

Here,
N is the total number of words in the corpus,
L is the number of different words (vocabulary),
G is the constant that depends on the characteristic of the corpus, and
C1 is a constant. (Ejiri, et al.; "Proposal of a new constraint measures for text," Contribution to Qualitative Linguistics, (ed. R. Koehler and B. Rieger), 195-211, Kluwer Academic Publishers, 1993)

The constant
G depends on the content of the text, and its value decreases when the text content becomes richer. Figure 2 shows the value for various texts including English and Japanese. In most cases,
G falls within 0.1 and 0.5;

0.1 ≤
G ≤ 0.5 · · · ·(3)

Fig.2 G value for various types of natural language text.The left column of G shows the name of the text.

With two equations (2) and (3)

0.5 · log
N+0.5 ·
C1 ≤ log
L ≤ 0.9 · log
N+0.1 ·
C1

or

1.1log
L-0.11 ·
C1 ≤ log
N ≤ 2 · log
L-
C1.

The above two expressions are modified to

10
0.5(log
N +
C
1) ≤
L ≤ 10
(0.9 · log
N +0.1 ·
C
1) · · · ·(4)

or

10
(1.1log
L -0.11 ·
C
1) ≤
N ≤ 10
(2 · log
L -
C
1) · · · ·(5)

by setting the base of the logarithm as 10. These equations (4) and (5) tell that "vocabulary" and "word count" are interdependent and one derived from the other. According to the original paper, the value
G fluctuates depending on the nature of the target text. However, the value is stable within the same text; only a small part of the text is enough to calculate this
G.

By modifying equation (1), we have

log
L =(1-
G )log
N +
G,

that is

L =10
{ (1-
G ) log
N +
G } · · · ·(6)

With similar modification, parameter
N is also expressed as

or

L =10
(log
L-
G) / (1-
G) · · · ·(7)

This result also tells us that "a higher word count results with larger text." Does it mean that a word dictionary is impossible?

For a word processor, a word dictionary is inevitable. To maintain a word dictionary, do we need to add 2000 words annually? This is too much effort for every language dictionary. To avoid this burden, the idea of a probability based word dictionary was proposed. In this dictionary, the word ABS is approximately expressed as p(A|B)p(C)+p(A)p(B|C), as a product of bi-grams. If the probability exceeds 0.5, then the word ABS may exist. With this method, only bi-gram needs to be maintained, ignoring word length. Another benefit of this probability dictionary is the protection from illegal copying. Any explicit dictionary is easily copied because the word passed the spelling check must exist. Using above process, a spelling check dictionary with exactly the same function was reproduced (copied). This type of spelling check dictionary was introduced to the market in the early nineties for Scandinavian languages.

Language and mathematics are thought to be far apart. But thanks to high speed computers, highly intelligent processing becomes practical. Language processing, which has been dominated by the human brain, has opened its door to digital processing.