The Japanese Bigger Analogy Test Set (jBATS)

Word analogy task has been one of the standard benchmarks for word embeddings since the striking demonstration of “linguistic regularities” by Mikolov et al. (2013) [1]. Their big finding was that linear vector offsets seem to mirror linguistic relations, and can be used to perform analogical reasoning.

For example, consider a pair of words a : a' :: b : b such as man:
woman :: king : queen. Mikolov et al. proposed that the b' word could be found with the help of the offset between the word vectors b and a. In this case, the word vector queen should be the vector that is the closest to the result of calculation king - man + woman.

Mikolov et al. demonstrated their findings on the Google analogy dataset, which consists of 9 morphological and 5 semantic categories, with 20-70 unique word pairs per category: 8,869 semantic and 10,675 syntactic questions in total. However, this set was unbalanced, and some categories were overrepresented (particularly the country:capital relation constituted over 50% of all word pairs in the semantic category).
Gladkova et al. (2016) proposed BATS (The Better Analogy Test Set), which covers eaqually derivational and inflectional morphology, as well as lexicographic and encyclopedic semantics. Each of these relations consists of 10 categories and each of the categories is being represented by 50 unique word pairs giving a total of 99,200 questions [3] for the vector offset method.

What is jBATS?

jBATS was designed based on BATS and is, to our best knowledge, the first analogy test set for Japanese. Similarly to BATS it features 4 linguistcs relations: (1) derivational morphology, (2) inflectional morphology, (3) lexicographic semantics, and (4) encyclopedic semantics.

jBATS was structured to be:

balanced and representative - similarly to BATS, jBATS also offers 4 linguistics relations, each of which consists of 10 categories featuring 50 distinguish pairs (with an exception of the city:prefecture pairing, which contains 47 pairs since there are only 47 prefectures). This gives a total of 97,712 questions for the vector offset method.

tokenization friendly - all the pairs were desinged so that they can be used with a MeCab-like tokenization. Pairs, in which tokenization could introduce ambiguity, were avoided.

alternative spelling - the correct answers are provided in both kanji and hiragana/katakana forms. For example, 出す is being represented as 出す and だす.

multiple correct answers - similarly to BATS, jBATS is not penalizing the model for the complicity of human language. For example, 服, アパレル and お召し物 will all be listed among others as hypernyms of スカート.

Subcategory

Relations

Subcategory

Relations

Morphology (inflections)

verbs

I01: u-form > a-form (書く : 書か)

Semantics (lexicography)

hypernyms

L01: animals (カニ : 甲殻類 / こうかくるい / ...)

I02: u-form > o-form (受ける : 受けよ)

L02: miscellaneous (椅子 : 家具 / かぐ / ...)

I03: u-form > e-form (起きる : 起きれ)

hyponyms

L03: miscellaneous (肉 : 牛肉 / ぎゅうにく / ...)

I04: u-form > te-form (会う : 会っ)

meronyms

I04: substance (バッグ : 革 / かわ / ...)

I05: a-form > o-form (書か : 書こ)

L05: member (メンバー : クラブ / チーム / ...)

I06: o-form > e-form (歌お:歌え)

L06: part-whole (魚 : フィン / 骨 / ...)

I07: e-form > te-form (勝て:勝っ)

synonyms

L07: intensity (怖い : 恐ろしい / おそろしい / ...)

i-adjectives

I08: i-form > ku-form (良い:良く)

L08: exact (言う : 述べる / のべる / ...)

I09: i-form > ta-form (良い:良かっ)

antonyms

L09: gradable (強い : 弱い / よわい / ...)

I10: ku-form > ta-form (良く:良かっ)

L10: binary (大きい : 小さい / ちいさい)

Morphology (derivation)

suffix

D01: na-adj. + 化(活性:活性化)

Semantics (encyclopedia)

geography

E01: capitals (ロンドン : イギリス / 英国 / えいこく / ...)

D02: i-adj. + さ (良い:良さ)

E02: country:language (韓国 : 韓国語 / かんこくご)

D03: noun + 者 (消費:消費者)

E03: city:prefecture (秩父 : 埼玉県)

D04: noun + 会 (運動:運動会)

people

E04: nationalities (ショパン : ポーランド人 / ポーランドじん)

D05: noun/na-adj.+ 感 (存在:存在感)

E05: occupation (アリストテレス : 哲学者 / てつがくしゃ / ...)

D06: noun/na-adj.+ 性 (可能:可能性)

other

E06: company:product (日産 : 自動車 / じどうしゃ / ...)

D07: noun/na-adj.+ 力 (影響:影響力)

E07: onomatopoeia:feeling (イライラ : 嫌悪 / けんお / ...)

prefix

D08: 不 + noun (十分:不十分)

E08: thing:color (あざ : 青 / あお / ...)

D09: 大 + noun/na-adj. (好き:大好き)

E09: object:usage (ギター : 弾く / ひく / ...)

other

D10: (in)transitive verb (落ちる:落とす)

E10: polite terms (言う : おっしゃる / ...)

Performance on jBATS

jBATS was initially used to evaluate the performance of the subcharacter and character level models in Japanese [6]. These models take advantage of the information in Chinese characters - kanji (SG + kanji) and their components, called bushu (SG + kanji + bushu). The overall performance of both models was compared with the traditional Skip-Gram model (SG) and FastText.

Including the bushu information has proven to be beneficial especially for the inflectional and derivational relations, where most tokens were written using a single kanji or a kanji affix, related to the word meaning. The smallest improvement was observed for the lexicographic semantics categories, where traditional Skip-Gram model performed better in most cases. Semantic relations were also the most difficult to capture, which is consistent with the finidings for English in Gladkova et al. (2016) [2] and Drozd et al. (2016) [4]. Moreover, the LRCos method [4] yielded overall better results than 3CosAdd achieving up to over 36% better accuracy, which was also shown for English in Drozd et al. (2016) [4]. The overall accuracy on word analogy task (LRCos method) of all the models (including FastText) for different corpus size can be found in Karpinska et al. (2018) [6]