Abstract

The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies such as cross-language,cross-script,cross-orthographic, and cross-synonym searching, and demostrates that lexical databases must play a central role in their implementation.

1. Introduction

1.1 Brief Outline of Japanese Writing System

It is often said that Japanese has the most complex writing system in the world. As we shall see below, this claim is fully justified. Contemporary Japanese is written in a mixture of four scripts, each of which has a distinct function.

Thousands of logographic characters, called 漢字 kanji, derived from Chinese.

A native syllabic script called 平仮名 hiragana.

Another native syllabic script called 片仮名 katakana.

Recently the Latin alphabet, called ローマ字 roomaji, has become increasingly common.

Kanji are used to write the core of the Japanese vocabulary. This includes words of Chinese origin, words coined in Japan on the Chinese model, such as 山脈 sanmyaku 'mountain range', as well as native Japanese words, such as 山 yama 'mountain'. Kanji have three basic properties: form, sound, and meaning. Each character may be pronounced according to several distinct pronunciations, called readings. A character may have one or several Chinese derived on readings, or one or several (sometimes dozens) native Japanese kun readings, and each reading may have numerous meanings associated with it.

Hiragana is used mostly to write grammatical elements, such as inflectional verb endings, and sometimes for writing native Japanese words. For example, in 見た mita the kanji 見 represents the stem of the verb 見る miru 'see' and た ta is a verb ending for forming the past tense. The hiragana endings attached to a kanji stem are called 送り仮名 okurigana.

Katakana is used mostly to write Western loanwords, such as プリンター purintaa 'printer', and onomatopoeic words, such as カチッと kachitto 'with a click'. The Latin alphabet is used for writing acronyms, for some loanwords instead of katakana, and for stylistic effects, especially in the names of shops and magazines .

A running Japanese text normally consists of a mixture of kanji and kana, as shown below:

In the above sentence, case particles such as を o (object marker), as well as verb endings (-わせる -waseru in 組み合わせる kumiawaseru 'combine'), are written in hiragana, whereas nouns, such as 熟語 jukugo 'compound word', are written in kanji.

1.2 Intelligent Japanese Searching

Several factors contribute to the difficulties of Japanese information retrieval and query processing. To build a truly sophisticated, "intelligent" Japanese search engine, various challenges must be overcome. Here are some of the major issues:

The lack of a standard, universally accepted, orthography; that is, the presence of a large number of orthographic variants and easily confused homophones.

Miscellaneous technical requirements such as transcoding between multiple character sets and encodings, support for Unicode, and input method editors.

Each of the above are major issues that deserves a paper in its own right. In this paper, we will focus on one of the central linguistic issues; that is,

The lack of a standard, universally accepted, orthography.

To our knowledge, few if any search engines have addressed orthographic variation, nor any of the other linguistic issues described in this paper. Let us take a very quick look at the current state of search engine technology.

The earliest search engines, such as Altavista, Yahoo!, and Excite, are often referred to as first-generation search engines. Such an engine searches its index for the search term entered by the user, then generates a page with a list of relevant, and often irrelevant, links ranked by frequency of occurrence of the search term.

Second-generation search engines, such as Direct Hit and Northern Light, take a more intelligent approach to ensure relevancy by ordering the search results by various criteria such as popularity, semantic categories, link frequency, and page ranks. An excellent example of this is Google, which ranks results by the number of links from pages which themselves have a high rank.

None of these engines, including the very few that claim to be third generation, support but a bare minimum of computational linguistic features. Here, we will define the direction of third-generation search engines by focusing on the future of Japanese search and retrieval technology, especially such advanced linguistic technologies as cross-language,cross-script,cross-orthographic, and cross-synonym searching (described below).

1.3 The Big Picture

Superficially, it would seem as if search engines need only search for the actual keywords provided by the user. In fact, from personal discussions with the executives of several leading search engine companies, it is clear that they deliberately follow a policy of "not to cast a wide net", so that searching for "travelled to Britain" will not match "traveled to Britain", not to speak of "travel to the U.K."

Ostensibly, the justification for such a policy is to prevent flooding the user with irrelevant results. The real reason, no doubt, is that they do not possess the technology for linguistically sophisticated searching. What such a policy often does achieve is the proverbial "throwing the baby out with the bathwater", since many relevant results are indiscriminately ignored along with the irrelevant ones.

Now let us step back and have a closer look at the big picture, strictly from the user's point of view. That is, let us pose the most relevant question of all:

What, exactly, is it that a search engine user really wants?

In this paper we will demonstrate that, especially in the case of Japanese, the user is far more interested in the inherent meaning (semantic content) represented by the search term, rather than in the accidental form (written representation) of any of its orthographic variants.

In many of the major languages of the world, orthographic variation is not a major issue, since their orthographies tend to be stable. Though English is notorious for its spelling irregularities, spelling variants (such as 'judgement' vs. 'judgment', 'wordprocessor' vs. 'word processor') are more of an annoyance than a major obstacle. For the most part, users can expect the orthographic representation of the search term to have little or no variation.

Not so with Japanese. Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included (Simplified Chinese has a remarkably stable orthography).

Should the user of a Japanese search engine be required to be intimately familiar with these complexities? Obviously not. Ideally, users should only be concerned with quickly finding the information they are seeking, not with the intricacies of the Japanese writing system.

That, in a nutshell, is where the real power of an "intelligent" Japanese search engine comes in. It relieves the user of the burden of dealing with the details of how the search term should be written, and lets her focus on the real issue at hand: defining the content of the information to be retrieved.

2. Cross-Orthographic Searching

This section presents a brief overview of Japanese orthographic variation, focusing on those issues most relevant to information retrieval. The highly irregular Japanese orthography is a major obstacle to efficient searching. Our aim is to describe the complexities of the Japanese orthography, and to demostrate that the intelligent search engine should be capable of retrieving all the orthographic variants of the search term; in other words, of performing cross-orthographic searching.

2.1 Orthographic Chaos?

The Japanese orthography is highly unstable, bordering on the chaotic.
A major factor that contributes to this state of affairs is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways.

Study the table below, which shows the orthographic variants of the words 取り扱い toriatsukai 'handling' and 当たり外れ ataraihazure 'hit or miss'.

Some Orthographical Variants

toriatsukai

atarihazure

Type of variant

取り扱い

当たり外れ

"standard" form

取扱い

当り外れ

okurigana variant

当外れ

okurigana variant

取扱

当外

all kanji

とり扱い

当たりはずれ

replace kanji with hiragana

取りあつかい

あたり外れ

replace kanji with hiragana

とりあつかい

あたりはずれ

all hiragana

It is important to note that these variants are not contrived examples for the sake of illustration. All the above forms do occur in contemporary Japanese, though some are less frequent than others. Even in carefully edited publications, not to speak of sloppily written webpages, there is no reliable way to predict which particular variant will occur, as this often depends on the whim of the author or editorial policy.

How does such orthographic variation affect the search engine user? Let us examine this issue a bit more in depth, looking at it entirely from the user's point of view. Let us say that a user is looking for a novel called Hi no sasanai yashiki (A Mansion with no Sunshine). Here are twelve legitimate ways (some more likely than others) of how to write this phrase.

日の差さない屋敷

日の射さない屋敷

日のささない屋敷

日の射さない邸

日の差さない邸

日のささない邸

陽の射さない屋敷

陽の差さない屋敷

陽のささない屋敷

陽の射さない邸

陽の差さない邸

陽のささない邸

We did a quick survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly (or not surprisingly), we recveived six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, users of Japanese search engines cannot possibly be expected to know which specific variant is used in the official title of the book.

Now let us say that the user is searching for the Japanese equivalent of the proverbial "A hen that lays golden eggs." Theoretically, the "standard" way of writing this is:

金の卵を産む鶏Kin no tamago wo umu niwatori

In reality, three of the keywords above have the following common variants:

English

Reading

"Standard"Form

Variant 1

Variant 2

Variant 3

egg

tamago

卵

玉子

たまご

タマゴ

hen

niwatori

鶏

にわとり

ニワトリ

to lay

umu

産む

生む

Combining the permutations for the above three words yields 24 possible ways of writing the original phrase, as shown in the table below:

"Kin no tamago wo umu niwatori"
'A hen that lays golden eggs'

鶏

にわとり

ニワトリ

1. 金の卵を産む鶏

9. 金の卵を産むにわとり

17. 金の卵を産むニワトリ

2. 金の卵を生む鶏

10.金の卵を生むにわとり

18. 金の卵を生むニワトリ

3. 金の玉子を産む鶏

11.金の玉子を産むにわとり

19. 金の玉子を産むニワトリ

4. 金の玉子を生む鶏

12. 金の玉子を生むにわとり

20. 金の玉子を生むニワトリ

5. 金のたまごを産む鶏

13. 金のたまごを産むにわとり

21. 金のたまごを産むニワトリ

6. 金のたまごを生む鶏

14. 金のたまごを生むにわとり

22. 金のたまごを生むニワトリ

7. 金のタマゴを産む鶏

15. 金のタマゴを産むにわとり

23. 金のタマゴを産むニワトリ

8. 金のタマゴを生む鶏

16. 金のタマゴを生むにわとり

24. 金のタマゴを生むニワトリ

Again, rest assured that this is not a contrived example designed to make a point. Not only is each of the variants for tamago, niwatori and umu of frequent occurrence in written Japanese -- they actually occur within the phrase in question, as can be easily verified by searching with your favorite search engine. Clearly, the user has no hope of finding all these variants unless the search engine can perform sophisticated cross-orthographic searching, as discussed below.

2.2 Okurigana Variants

A major factor that exacerbates the difficulties of Japanese information retrieval is the orthographic variations that occurs in kana endings, called 送り仮名 okurigana, that are attached to a kanji base or stem, as shown in the table below:

Okurigana Variants

English

Reading

"Standard" Form

Variant 1

Variant 2

Variant 3

publish

kakiarawasu

書き表す

書き表わす

書表わす

書表す

perform

okonau

行う

行なう

Tokyo-bound

Tookyoo-yuki

東京行き

東京行

Okurigana variants are very common. Though the Japanese government publishes guidelines, actual usage is unpredictable and depends on editorial policy or personal preference. The intelligent Japanese search engine should be able to retrieve all such variants, regardless of how the search term is input.

2.3 Cross-script Searching

As explained in Section 1, Japanese is written in a mixture of four different scripts. Orthographic variation across scripts is a common occurrence, so that the same word can be written in hiragana, katakana, kanji, or even in the Latin alphabet. To compound these difficulties, the same word is sometimes written in two scripts, such as hiragana and kanji.

Study the table below carefully, which shows all the major cross-script variation patterns in that occur in Japanese:

Cross-Script Orthographic Variation

No.

English

Reading

Kanji

Hiragana

Katakana

Latin

Hybrid

1

many people

oozei

大勢

おおぜい

2

say

iu (yuu)

言う

いう

3

sulfur

ioo

硫黄

イオウ

4

cat

neko

猫

ねこ

ネコ

5

kilo

kiroguramu

キログラム

kg

6

shirt

waishatsu

ワイシャツ

Ｙシャツ

7

skin

hifu

皮膚

ヒフ

皮フ

8

comet

suisei

彗星

すい星

9

glittering

pikapika

ぴかぴか

ピカピカ

10

open

oopun

オープン

openOPEN

The above table shows that almost any combination of scripts can occur: kanji vs. hiragana, hiragana vs. katakana, Latin vs. katakana, etc. Cross-script variation is as common as it is unpredictable. From an information retrieval point of view, what is particularly irksome is the recent tendency to write many common kun words (like #2 above), and even on words (like #1 above), in hiragana instead of kanji, based on the widespread misconception that hiragana is "easier" to read.

When inputting a keyword, the user cannot be expected to be aware of these script differences. It goes without saying that the intelligent search engine should be capable of performing cross-script searching; that is, should be able to retrieve all such variants, regardless of the script the keyword is provided in.

2.4 Kanji Variants

Though written Japanese underwent major reforms in the postwar period, resulting in the simplification and standardization of character forms, there is nevertheless a significant number of character form variants in common use, especially in proper names. Classical Japanese literature and religious texts such as the Buddhist scriptures are written almost exclusively in the traditional old forms.

Kanji Variants and Traditional Forms

English

Reading

Standard

Variant

Comment

largely

oohaba ni

大幅に

大巾に

abbreviated form

10 years old

jussai

十才

十歳

variant form

proper name

Nakajima

中島

中嶋

variant form

development

hattatsu

発達

發達

traditional form

Since the use of variant forms is not uncommon, the intelligent search engine should be able to retrieve all such forms. For searching classical texts and religious scriptures, retrieving the traditional forms based on the simplified forms is especially important.

2.5 Phonetic Substitutes

Japanese has numerous orthographic variants based on the principle of phonetic substitution. For example, 盲 is interchangeable with 妄 in such compounds as 妄想 (=盲想) moosoo 'wild idea', but not in 盲従 moojuu 'blind obedience'. One such variant, in this case 妄, is a phonetically replaced character, and the other, in this case 盲, is a phonetic replacement character. Such characters have the same reading and are often similar in meaning.

Phonetic Substitutes

English

Reading

PhoneticReplacement

PhoneticallyReplaced

fermentation

hakkoo

発酵

醗酵

satire

fuushi

風刺

諷刺

linking

renkei

連係

連繋

linking

moosoo

盲想

妄想

abuse

ranyoo

乱用

濫用

Though the older, phonetically replaced characters are gradually going out of use, their frequency of occurrence is sufficiently high to warrant support by search engines.

2.6 Katakana Variants

The katakana syllabary is used mostly to write Western loanwords, onomatopoeic words, names of plants and animals, non-Japanese personal and place names, for emphasis, and for slang. Recent years have seen an enormous increase in katakana use, especially in technical terminology.

Unfortunately, katakana orthography is often irregular, so that the same word may be written in multiple ways. Basically, the katakana transliteration of a loanword is an attempt to approximate the pronunciation of its etymon (the foreign word from which it is derived). Although there are general guidelines for loanword orthography, in practice there is considerable variation.

Katakana variation can be classified into the following types:

The presence or absence of a macron (a dash-like symbol that indicates long vowels).

The presence or absence of nakaguro -- a middle dot between katakana words).

Replacing macrons with actual vowels to indicate long vowels.

A single foreign sound may be transcribed by multiple kana characters.

Miscellaneous katakana variants.

Typology of Katakana Variation

VariationType

English

Reading

Standard Form

Variants

1. Macron

computer

konpyuutakonpyuutaa

コンピュータ

コンピューター

user

yuuzayuuzaa

ユーザー

ユーザ

2. Nakaguro

online

onrain

オンライン

オン・ライン

ice cube

aisukyuubu

アイスキューブ

アイス・キューブ

3. Long vowels

eye shadow

aishadoo

アイシャドー

アイシャドウ

maid

meedo

メード

メイド

4. Multiple kana

Diesel

diizerujiizeru

ディーゼル

ジーゼルヂーゼル

team

chiimutiimu

チーム

ティーム

violin

baiorinvaiorin

バイオリン

ヴァイオリン

5. Others

quota

kuootaa kwootaa

クオーター

クォーター

Jerusalem

ierusaremu

エルサレム

イェルサレム

The above is only a brief introduction to the complexities of katakana variation, which is as common as it is unpredictable. To relieve the user from the burden of guessing the correct variant, the intelligent Japanese search engine should be capable of retrieving all katakana variants of the search term.

2.7 Hiragana Variants

As explained in Section 1, the hiragana syllabary is used mostly to write grammatical elements and some native Japanese words, such as adverbs and particles. In recent years there has been a considerable increase in the use of hiragana, both for stylistic effects and because of the popular belief that hiragana is easier to read than kanji.

Unfortunately, both for humans and computers, hiragana strings are considerably more difficult to segment than the equivalent kanji-kana texts. Since there are no delimiters between words, identifying the lexemes in a hiragana string is often a futile exercise in disambiguation.

If we discount the okurigana irregularities and cross-script variations explained in Section 2.1 and Section 2.2, the hiragana orthography is, in itself, quite regular. Nevertheless, there is a certain amount of hiragana irregularities, as explained below:

The use of traditional kana orthography for the case particles は, へ and を, instead of わ, え and お.

The use of traditional お instead of う to indicate long o in certain words.

The use of voiced ぢ and づ in place of じ and ず when the former are (1) proceeded by ち and つ or (2) appear in voiced compounds derived from ち and つ. But for some words, as shown below, じ and ず are preferred. Actual usage is unpredictable.

The use of historical kana orthography in prewar texts, classical literature and Buddhist scriptures.

Miscellaneous hiragana variants, such as the use of the kana repetition symbol ゝ.

Typology of Hiragana Variation

VariationType

English

Reading

Standard Form

Variants

1. Particles

Hello

konnichi wa

こんにちは

こんにちわ

2. Traditional

way

toori

とおり

とうり

big

ookii

おおきい

おうきい

3. ぢ and づ

to shrink

chijimu

ちぢむ

ちじむ

to continue

tsuzuku

つづく

つずく

nosebleed

hanaji

はなぢ

はなじ

to nod

unazuku

うなずく

うなづく

4. Historical

to use

mochiiru

用いる

用ゐる

long oo

koo

こう

かう, かふ

smell

nioi

におい(匂い)

にほひ

5. Others

here

koko

ここ

こゝ

Although hiragana variation is a relativity minor issue, the intelligent Japanese search engine should be capable of retrieving hiragana variants, regardless of how the keyword is input.

3. Cross-Homophone Searching

This section presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that the intelligent search engine should be capable of optionally retrieving the various homophonic variants of the search term; in other words, of performing cross-homophone searching.

3.1. Some Definitions

A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.

Homophone: One of two or more words that are pronounced the same but differ in writing and usually in meaning (e.g. principal and principle).

Homograph: One of two or more words that are written the same but differ in pronunciation and (usually) in meaning (misleadingly also called heteronyms) (e.g. minute "60 seconds" and minute "very small").

Homonym: One of two or more words that are identical in writing and/or pronunciation but differ in meaning (sometimes called homologues) (e.g. light "not heavy" and light "not dark").

Orthographic Variant: One of two or more words that are written differently but are identical in pronunciation and meaning (sometimes called heterographs) (e.g. judgement and judgment).

3.2 Overview of Japanese Homophony

An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.

On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:

・

Kun Homophones

Easily Distinguished

Easily Confused

hashi

noboru

橋 端箸

bridge end, edge chopsticks

上る 登る昇る

go up (steps, a hill) climb, scale ascend, rise (up to the sky)

Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.

To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and searching more difficult (see 2.3 Cross-script Searching for details).

3.3 Intelligent Homophone Searching

Cross-homophone searching requires a semantically classified database of homophones and a homophone expansion algorithm. The process of searching for Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. In English, however, cross-homophone searching is clearly undesirable. A user searching for right is most certainly not interested in finding write.

Not so in Japanese. From an information retrieval point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.

This means that, when the user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. We have already seen in Section 2.2 above how hopelessly difficult it is for the user to select the appropriate homophones when searching for the book title Hi no sasanai yashiki. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.

Kun Homophones for sasu

No.

English

"Standard"Form

Sometimesalso

Oftenalso

1

to offer

差す

さす

2

to hold up

差す

さす

3

to pour into

差す

注す

さす

4

to color

差す

注す

さす

5

to shine on

差す

射す

さす

6

to aim at

指す

差す

6

to point to

指す

さす

7

to stab

刺す

さす

8

to leave unfinished

さす

止す

To sum up, Japanese homophones have certain characteristics that exacerbate the difficulties of retrieving them:

Since many kun homophones are nearly synonymous or even identical in meaning, they are easily confused. As a result, there is no way to predict which particular homophone will appear in a text.

The distinction between some homophones is so subtle that many authors sidestep the irksome task of selecting the appropriate kanji and resort to hiragana.

Since Japanese has only a small stock of phonemes, the number of homophones is very large.

As things stand now, the entire burden of homophone searching falls upon the user. The intelligent search engine, by performing cross-homophone searching at the user's request, will relieve the user of this burden by retrieving all the homophones in the relevant homophone group.

Implementing such technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.

4. Homograph Disambiguation

4.1 Overview of Japanese Homography

Below is a brief overview of Japanese homography. A homograph is one of two or more words that are written the same but differ in pronunciation and (usually) in meaning e.g. minute "60 seconds" and minute "very small".

Japanese has numerous kanji that have multiple on and kun readings, which gives rise to a large number of homographs. The table below lists some typical examples:

Japanese Homographs

Num.

Homograph

Reading

English

1

一時

ichiji

one o'clock; temporarily

一時

hitotoki

a while

一時

ittoki

a moment; 12th part of day

2

一章

isshoo

one chapter

一章

kazuaki

a first name

3

仮名

kana

kana syllabary

仮名

kamei

fictitious name, pseudonym

仮名

karina

alias, assumed name

仮名

kemyoo

fictitious name, pseudonym

4

化学

kagaku

chemistry

化学

bakegaku

chemistry

Unlike English homographs, which differ in meaning, the meanings of Japanese homographs could be identical (化学 above), totally different (一章 above), or partially synonymous (一時 above).

4.2 Intelligent Homograph Searching

Since the number of homographs in Japanese is very large (we found over 20,000 in our databases), it follows that a failure to identify specific homographs will often lead to irrelevant results. However, it is self-evident that, since homographs are written in exactly the same way, retrieving the one semantically relevant to the search term is a very difficult task. This is called homograph disambiguation, and is an important issue in text-to-speech synthesis. This is similar to searching for a polysemous word used in a specific sense, such as for 'table' in the sense of "article of furniture," as opposed to 'table' in the sense of "rectangular array of data."

The truly intelligent search engine should be capable of performing homograph disambiguation.

5. Cross-Synonym Searching

The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.

The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.

5.1 Overview of Japanese Synonymy

From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.

Synonymy
A relation between a set of words that are similar (near-synonyms) or identical (absolute-synonyms) in meaning.

Relation

English

Reading

Japanese

Shared concept

money

kane

金

Synonyms

currency

tsuuka

通貨

cash

genkin

現金

bank note

shihei

紙幣

Hyponymy and Hyperonymy
A relation between a set of specific (subordinate) terms, called hyponyms, and a generic (superordinate) term, called the hyperonym. The hyperonym is more general and includes the senses of the hyponyms.

Relation

English

Reading

Japanese

Hyperonym

sound

oto

音

Hyponyms

voice

koe

声

echo

hankyoo

反響

noise

sooon

騒音

Meronomy
A relation between a set of subordinate words, called meronyms, whose meanings are in a partitive (part-of) relation to a more comprehensive concept, called a holonym.

Relation

English

Reading

Japanese

Holonym

city

shi

市

Meronyms

ward

ku

区

town section

choo

町

town subsection

choome

丁目

Complementarity
A relation between a set words that contrast with each other and are mutually exclusive:

Relation

English

Reading

Japanese

Shared concept

siblings

kyoodai

兄弟

Complementary terms

older brother

ani

兄

younger brother

otooto

弟

older sister

ane

姉

younger sister

imooto

妹

Antonymy
A relation between words, called antonyms, of opposite meanings, such as 清潔な seiketsu na 'clean' and 汚い kitanai 'dirty'. Antonyms are probably not of interest in information retrieval.

5.2 The Semasiological Approach

Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.

It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.

As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:

"Kennedy was killed on ..."

"The murder of Kennedy was ..."

"JFK had to be eliminated because ..."

To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.

5.3 The Onomasiological Approach

From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept[cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."

The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as the onomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.

5.4 Intelligent Synonym Searching

Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.

How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.

To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:

Concept: [to cause to die]

English

Reading

Japanese

to kill

korosu

殺す

to commit murder

satsujin o okasu

殺人を犯す

to execute

shokei suru

処刑する

to murder

satsugai suru

殺害する

to shoot to death

shasatsu suru

射殺する

to assassinate

ansatsu suru

暗殺する

to bump off

yaru

やる, 殺る

to butcher

barasu

ばらす

Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.

6. Cross-Language Searching

Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:

Cross-Language Searching

Search Term

Search Results

Reading

Japanese economy

日本(の)経済

Nihon (no) keizai

Tokyo

東京

Tookyoo

happy

幸福 幸せ

koofuku shiawase

NEC

日本電気 ＮＥＣ

Nihon Denki en-ii-shii

Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:

English to Katakana Conversion

Search keyword

Search results

computer

コンピュータ コンピューター

WWW

ワールドワイドウェブ
ウェブ
ＷＷＷ

Diesel

ディーゼル
ジーゼル

Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.

7. Morphological Analysis

This section briefly describes morphological analysis, an essential component of any Japanese search engine, and miscellaneous search technologies not covered in other sections.

7.1. Morphological and Lexemic Analysis

To perform query processing and search and retrieval operations, a Japanese search engine must be capable of processing a Japanese text on two levels: morphological and lexemic. Morphological analysis refers to computational procedures such as stemming and conflation that operate on the morphemic level (described below). The more difficult lexemic analysis refers to identifying word boundaries by segmenting a text stream into meaningful semantic units (such as lexemes) for dictionary lookup and indexing purposes.

Segmentation and morphological analysis are central to Japanese search engine technology, and each deserves a paper in its own right. Below, we will briefly describe some of the issues.

7.2. Conflation and Stemming

The Japanese language is agglutinative; that is, it forms words by putting together basic elements, called morphemes, that retain their original forms and meanings with little change during the combination process (more information on Japanese morphology.). Inflection in Japanese typically consists of adding to a stem conjugational endings to indicate various grammatical functions, such as tense. The resulting word is another word-form of the underlying lexeme, not a new word in itself, as shown in table below (only basic forms are given):

Conjugation Paradigm for 書く kaku 'to write'

Category

Affirmative

Reading

Negative

Reading

Non-past

書く

kaku

書かない

kakanai

Non-past polite

書きます

kakikamasu

書きません

kakimase

Past

書いた

kaita

書かなかった

kakanakatta

Past polite

書きました

kakimashita

書きませんでした

kakimasen deshita

Gerund

書いて

kaite

書かないで

kakanaide

Continuative

書き

kaki

Conditional

書けば

kakeba

書かなければ

kakanakereba

Imperative

書け

kake

書くな

kaku na

Tentative

書こう

kakoo

Tentative polite

書きましょう

kakimashoo

A major issue in the indexing and retrieval of Japanese texts is the extensive morphological variation in word-forms. A Japanese search engine must not only be capable of segmenting the search term into meaningful semantic units (such as lexemes and multi-word terms), but must also be capable of ignoring morphological variants like those shown in the above table.

A computational procedure designed to match morphological variants by reducing them to a single form for retrieval purposes is called a conflation algorithm. A procedure for processing a word, often by removing the inflectional endings to find its stem, is called stemming. Conflation and stemming make it possible to retrieve any inflectional form from any of the others, ensuring that potentially relevant documents are not lost.

Because of the morphological complexity of Japanese, it goes without saying that the intelligent Japanese search engine must be capable of both conflation and stemming, since they are a prerequisite for implementing the various search technologies described in this paper, such as cross-orthographic and cross-language searching.

7.3 Regular Expressions

Adding double-byte-enabled regular expression functionality to Japanese search engines will provide users with a tool for highly flexible searching far more powerful than Boolean expressions. A detailed treatment of regular expressions is outside the scope of this paper. A full analysis can be found in an excellent book on the subject, Mastering Regular Expressions by Jeffrey Friedl.

Below are a few examples of some metacharacters use in regular expressions (based on the POSIX standard):

Some Metacharacters in Regular Expressions

Metacharacter

Description

.

any single character

[ ]

any one of the characters in the brackets

[^]

any characters except for those after the caret

^

beginning of line

$

end of line

\<

beginning of word

\>

end of word

\t

tab character

\n

newline character

7.4 Miscellaneous Search Techniques

This paper covers the major issues in building an intelligent Japanese search engine, but is by no means exhaustive. There are various other possibilities, such as:

Loanword conversion. Katakana loanwords to native words conversion; for example, the search term ドリンク dorinku 'drink' can be used to retrieve the corresponding kun and on equivalents 飲み物 nomimono and 飲料 inryoo. This is a limited implementation of cross-synonym searching.

Lexeme-based retrieval. Perform lexeme-based, rather than word-based, retrieval. For example, in searching for the keyword "high school", exclude webpages in which "high" is separated from "school" since these are unrelated to the lexeme 'high school'.

Character normalization. The relativity trivial normalization of character types, such as ignoring half-width/full-width differences (e.g. ＯＰＥＮ/OPEN) and various symbols and punctuation marks.

8. Lexical Databases

Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.

The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.

Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:

General Vocabulary. A comprehensive database of about 450,000 entries covering general vocabulary. The rich set of grammatical attributes is fine-tuned to support search engine applications, especially morphological analyzers and word segmenters (more information).

Katakana Loanwords. About 50,000 loanwords and other Japanese words written in katakana, with special focus on computer and Internet terminology (more information).

Japanese Names. About 600,000 Japanese (and Chinese) personal and place names semantically classified and ranked by frequency (more information).

Western Names. An English-Japanese database of about 60,000 non-Japanese personal and place names, semantically classified and accompanied by English equivalents (more information).

Japanese Companies. About 600,000 Japanese company and organization names ranked by frequency with English equivalents when appropriate (more information).

Orthographic Variants. A database of about 60,000 orthographic variants, with full coverage of okurigana, kanji, and kana variants, designed to support cross-script and cross-orthographic searching (more information).

Homograph Groups. A database of about 34,000 homographs designed to support homograph disambiguation (more information).

Synonym Groups. A database of semantically classified synonym groups consisting of kanji synonyms, homonyms and meronyms serving as a basis for a Japanese thesaurus designed to support cross-synonym searching (more information).

English-Japanese Dictionary. An English-Japanese lexical database of over 100,000 entries covering general vocabulary and important proper names. This can be expanded to cover Western names and technical terms.

Kanji Database. A single-character database that covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes ( more information).

Orthographic Variation Rules A comprehensive collection of rules of katakana, hiragana, and kanji orthographic variation which that can be used to generate variants not listed in the database.

9. Conclusions

As we have seen, cross-orthographic searching is essential for intelligent Japanese information retrieval and query processing. However, the current lineup of first- and second-generation Japanese search engines is incapable of cross-orthographic searching, not to speak of the other, more advanced, retrieval technologies discussed in this paper. We have also seen that, because of the complexities and irregularities of the Japanese writing system, the implementation of intelligent retrieval technologies requires not only computational linguistic tools such as morphological analyzers, but also lexical databases fine-tuned to the needs of Japanese search engines.

Below is an outline of our vision for the future directions of third-generation Japanese search engine technology. The minimum requirements for what we shall refer to as a Level 1 Intelligent Japanese Search Engine (IJSE) are as follows:

A linguistically sophisticated Japanese Morphological Analyzer capable of segmenting a Japanese text stream into meaningful units such as lexemes.

The lack of sophisticated tools to cope with the complexities of the Japanese script places users of Japanese search engines and major portals such as e-commerce sites at a distinct disadvantage. The information retrieval industry in general, and the search engine industry in particular, is in urgent need of third-generation retrieval technology capable of meeting the challenges of intelligent Japanese searching.

The The CJK Dictionary Institute finds itself in a unique position to provide the comprehensive, high quality lexical resources and the software infrastructure required for building intelligent Japanese information retrieval technology.

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

Halpern, Jack (1982): “Linguistic Analysis of the Function of Kanji in Modern Japanese,” 27th International Conference of Orientalists in Tokyo.

Halpern, Jack (1985): “Function of Kanji in Modern Japanese, ” Transactions of the International Conference of Orientalists in Japan. The Tōhō Gakkai (The Institute of Eastern Culture). 27th International Conference of Orientalists in Japan in Tokyo.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

DESK currently has over two million Japanese and about one million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.