Abstract

The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies, focusing on cross-language and cross-synonym searching. Such important areas as cross-script and cross-orthographic searching and homophone expansion are dealt with in separate papers.

1. Cross-Synonym Searching

The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.

The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.

1.1 Overview of Japanese Synonymy

From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.

Synonymy
A relation between a set of words that are similar (near-synonyms) or identical (absolute-synonyms) in meaning.

Relation

English

Reading

Japanese

Shared concept

money

kane

金

Synonyms

currency

tsuuka

通貨

cash

genkin

現金

bank note

shihei

紙幣

Hyponymy and Hyperonymy
A relation between a set of specific (subordinate) terms, called hyponyms, and a generic (superordinate) term, called the hyperonym. The hyperonym is more general and includes the senses of the hyponyms.

Relation

English

Reading

Japanese

Hyperonym

sound

oto

音

Hyponyms

voice

koe

声

echo

hankyoo

反響

noise

sooon

騒音

Meronomy
A relation between a set of subordinate words, called meronyms, whose meanings are in a partitive (part-of) relation to a more comprehensive concept, called a holonym.

Relation

English

Reading

Japanese

Holonym

city

shi

市

Meronyms

ward

ku

区

town section

choo

町

town subsection

choome

丁目

Complementarity
A relation between a set words that contrast with each other and are mutually exclusive:

Relation

English

Reading

Japanese

Shared concept

siblings

kyoodai

兄弟

Complementary terms

older brother

ani

兄

younger brother

otooto

弟

older sister

ane

姉

younger sister

imooto

妹

Antonymy
A relation between words, called antonyms, of opposite meanings, such as 清潔な seiketsu na 'clean' and 汚い kitanai 'dirty'. Antonyms are probably not of interest in information retrieval.

1.2 The Semasiological Approach

Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.

It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.

As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:

"Kennedy was killed on ..."

"The murder of Kennedy was ..."

"JFK had to be eliminated because ..."

To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.

1.3 The Onomasiological Approach

From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept[cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."

The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as the onomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.

1.4 Intelligent Synonym Searching

Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.

How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.

To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:

Concept: [to cause to die]

English

Reading

Japanese

to kill

korosu

殺す

to commit murder

satsujin o okasu

殺人を犯す

to execute

shokei suru

処刑する

to murder

satsugai suru

殺害する

to shoot to death

shasatsu suru

射殺する

to assassinate

ansatsu suru

暗殺する

to bump off

yaru

やる, 殺る

to butcher

barasu

ばらす

Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.

2. Cross-Language Searching

Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:

Cross-Language Searching

Search Term

Search Results

Reading

Japanese economy

日本(の)経済

Nihon (no) keizai

Tokyo

東京

Tookyoo

happy

幸福 幸せ

koofuku shiawase

NEC

日本電気 ＮＥＣ

Nihon Denki en-ii-shii

Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:

English to Katakana Conversion

Search keyword

Search results

computer

コンピュータ コンピューター

WWW

ワールドワイドウェブ
ウェブ
ＷＷＷ

Diesel

ディーゼル
ジーゼル

Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.

3. Lexical Databases

Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.

The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.

Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:

General Vocabulary. A comprehensive database of about 450,000 entries covering general vocabulary. The rich set of grammatical attributes is fine-tuned to support search engine applications, especially morphological analyzers and word segmenters (more information).

Katakana Loanwords. About 50,000 loanwords and other Japanese words written in katakana, with special focus on computer and Internet terminology (more information).

Japanese Names. About 600,000 Japanese (and Chinese) personal and place names semantically classified and ranked by frequency (more information).

Western Names. An English-Japanese database of about 60,000 non-Japanese personal and place names, semantically classified and accompanied by English equivalents (more information).

Japanese Companies. About 600,000 Japanese company and organization names ranked by frequency with English equivalents when appropriate (more information).

Orthographic Variants. A database of about 60,000 orthographic variants, with full coverage of okurigana, kanji, and kana variants, designed to support cross-script and cross-orthographic searching (more information).

Homograph Groups. A database of about 34,000 homographs designed to support homograph disambiguation (more information).

Synonym Groups. A database of semantically classified synonym groups consisting of kanji synonyms, homonyms and meronyms serving as a basis for a Japanese thesaurus designed to support cross-synonym searching (more information).

English-Japanese Dictionary. An English-Japanese lexical database of over 100,000 entries covering general vocabulary and important proper names. This can be expanded to cover Western names and technical terms.

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

Halpern, Jack (1982): “Linguistic Analysis of the Function of Kanji in Modern Japanese,” 27th International Conference of Orientalists in Tokyo.

Halpern, Jack (1985): “Function of Kanji in Modern Japanese, ” Transactions of the International Conference of Orientalists in Japan. The Tōhō Gakkai (The Institute of Eastern Culture). 27th International Conference of Orientalists in Japan in Tokyo.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

DESK currently has over two million Japanese and about 2.5 million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.