i= 1 (i,j )2E コーパスから得たベクトル（入力） Experimentally, we show that our method works where ↵ and βvalues control the relative strengths 5 well with different state-of-the-art wordvector mod- of associations (more details in §6.1). els, using different kinds of semantic lexicons and In this case, we ﬁrst train the word vectors inde- gives substantial improvements on a variety of pendent of the information in the semantic lexicons benchmarks, while beating the current state-of-the- and then retroﬁt them. is convex in Q and its so- art approaches for incorporating semantic informa- lution can be found by solving a system of linear tion in vector training and trivially extends to mul- equations. To do so, we use an efﬁcient iterative tiple languages. We show that retro t ﬁ ting gives updating method (Bengio et al., 2006; Subramanya consistent improvement in performance on evalua- et al., 2010; Das and Petrov, 2011; Das and Smith, tion benchmarks with different word vector lengths 2011). The vectors in Q are initialized to be equal and show a qualitative visualization of the effect of to the vectors in ˆ Q. We take the ﬁrst derivative of retro t ﬁ ting on word vector quality. The retro t ﬁ ting with respect to one qi vector, and by equating it to tool is available at: ht t ps : //gi t hub. com/ zero arrive at the following online update: mf ar uqui /r et r of i t t i ng. Pj:(i,j)2E βij qj + ↵i ˆqi (1) 2 Retro t ﬁ ting with Semantic L exicons qi = Pj:(i,j)2E βij + ↵i Let V = {w1, ..., wn}be a vocabulary, i.e, the set In practice, running this procedure for 10 iterations of wordtypes, and⌦beanontology that encodes se- converges to changes in Euclidean distance of ad- mantic relations between words in V . We represent jacent vertices of less than ⌦ 10− 2. The retroﬁtting as an undirected graph (V, E) with one vertex for approach described above is modular; it can be ap- each word type and edges (wi, wj ) 2 E ✓ V ⇥ V plied to word vector representations obtained from indicating a semantic relationship of interest. These any model as the updates in Eq. 1are agnostic to the relations differ for different semantic lexicons and original vector training model objective. are described later (§4). The matrix ˆ Q will be the collection of vector rep- Semantic L exicons during L earning. Our pro- resentations ˆ qi 2 Rd, for each wi 2 V , learned posed approach is reminiscent of recent work on using a standard data-driven technique, where d is improving word vectors using lexical resources (Yu the length of the word vectors. Our objective is and Dredze, 2014; Bian et al., 2014; Xu et al., 2014) to learn the matrix Q = (q1, ..., qn) such that the which alters the learning objective of the original columns are both close (under a distance metric) to vector training model with a prior (or a regularizer) their counterparts in ˆ Q and to adjacent vertices in ⌦. that encourages semantically related vectors (in ⌦) Figure 1 shows a small word graph with such edge to be close together, except that our technique is ap- connections; whitenodes arelabeledwiththeQ vec- plied as a second stage of learning. We describe the

tors to be retroﬁtted (and correspond to V⌦); shaded nodes are labeled with the corresponding vectors in ˆ Q, which areobserved. Thegraph can beinterpretedas a Markov random ﬁeld (Kindermann and Snell,1980). The distance between a pair of vectors is deﬁned to be the Euclidean distance. Since we want theinferred word vector to be close to the observedvalue ˆ qi and close to its neighbors qj , 8j such that (i, j) 2 E, the objective to be minimized becomes: Figure 1: Word graph with edges between related words 2 3 n showing the observed (grey) and the inferred (white) X X (Q) = 4↵ β 5 word vector representations. i kqi − ˆ qi k2 + ij kqi − qj k2 i= 1 (i,j )2E Experimentally, we show that our method works where ↵ and βvalues control the relative strengths well with different state-of-the-art wordvector mod- of associations (more details in §6.1). els, using different kinds of semantic lexicons and In this case, we ﬁrst train the word vectors inde- gives substantial improvements on a variety of pendent of the information in the semantic lexicons benchmarks, while beating the current state-of-the- and then retroﬁt them. is convex in Q and its so- art approaches for incorporating semantic informa- lution can be found by solving a system of linear tion in vector training and trivially extends to mul- equations. To do so, we use an efﬁcient iterative tiple languages. We show that retro t ﬁ ting gives updating method ( 解 Beng き io 方 et al., 2006; Subramanya consistent improvement in performance on evalua- et al., 2010; Das and Petrov, 2011; Das and Smith, tion benchmarks with different word vector lengt 2011). The vectors in • hs Q are initialized to be equal 反復更新で解を求める and show a qualitative visualization of the effect of to the vectors in ˆ Q. We take the ﬁrst derivative of – 各 q with i について，目的関数を最小化する値への更新を retro t ﬁ ting on word vector quality. The retro t ﬁ ting respect to one qi vector, and by equating it to 繰り返す tool is available at: ht t ps : //gi t hub. com/ zero arrive at the following online update: – qi は入力ベクトルで初期化 mf ar uqui /r et r of i t t i ng. Pj:(i,j)2E βij qj + ↵i ˆqi 更新式： (1) 2 Retro t ﬁ ting with Semantic L exicons qi = Pj:(i,j)2E βij + ↵i Let V = {w1, ..., wn}bea vocabulary, i.e, the set In practice, running this procedure for 10 iterations of wordtypes, and⌦beanontology that encodes s•e- 経験 co 的に nverg は es t10 o 回 cha の ng 反 es i復 n で Eu 近 cli づけ dean た dis い tanベ ce ク of トル ad- 間 mantic relations between words in V . We represent のユ ja ー cen ク t vリ erッ t ド ice 距 s o 離 f l は ess 0. t 01 han 未 10 満 − 2.にな Th る e retroﬁtting ⌦as an undirected graph (V, E) with onevertex for approach described above is modular; it can be ap- each word type and edges (wi, wj ) 6 2 E ✓ V ⇥ V plied to word vector representations obtained from indicating a semantic relationship of interest. These any model as the updates in Eq. 1are agnostic to the relations differ for different semantic lexicons and original vector training model objective. are described later (§4). The matrix ˆ Q will be the collection of vector rep- Semantic L exicons during L earning. Our pro- resentations ˆ qi 2 Rd, for each wi 2 V , learned posed approach is reminiscent of recent work on using a standard data-driven technique, where d is improving word vectors using lexical resources (Yu the length of the word vectors. Our objective is and Dredze, 2014; Bian et al., 2014; Xu et al., 2014) to learn the matrix Q = (q1, ..., qn) such that the which alters the learning objective of the originalcolumns are both close (under a distance metric) to vector training model with a prior (or a regularizer) their counterparts in ˆ Q and to adjacent vertices in ⌦. that encourages semantically related vectors (in ⌦) Figure 1 shows a small word graph with such edge to be close together, except that our technique is ap- connections; whitenodes arelabeledwiththeQ vec- plied as a second stage of learning. We describe the