In the field where this text field type is used, you can search documents without hassle no matter whether you specify ”International Monetary Fund” or ”IMF”. These documents can include either of them.

Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

System Architecture

Following is the architecture of this system.

This system registers contents of dictionary corpus to the Index of Lucene before analyzing the contents of index. Give “keyword” and “description” fields in index and set TermVector in the description field. Using Solr, you should easily be able to prepare this portion without going through the trouble of writing program. From here on, I will focus on how to analyze the index where dictionary corpus is registered.

To summarize the process of analysis program, it repeats the following process for every keyword.

If similarity S is greater than appropriate threshold value, write the pair (tA, tB) to a CSV file.

Followings are the detail of each process.

Identifying Synonym Candidate tB

Synonym entries that this system outputs are in the format “original word , abbreviation 1, abbreviation 2, …” where “original word” is the string itself in the “keyword” field while “abbreviation n” are abbreviations of the keyword placed in “description” fields. Japanese abbreviations are similar to English acronym. Not like English, however, there are so many abbreviations in Japanese as it has more letters than English.

This system compares keyword tA and word tR in the description field from the following standpoints – the system verifies if the acronym-like conditions are met.

The first letters of tA and tR match.

tA does not completely include tR.

Letters that make up tR are all included in tA and their orders of appearance match as well.

The system discards word tR if any one of the above conditions is not satisfied and moves to the next word. Also, even though all the conditions are met, it is possible that literal information just happened to match. The system, in that case, makes word tR as synonym candidate tB and goes to the next step where semantic similarity is being calculated.

Calculating Semantic Similarity S

The system next has to calculate the semantic similarity S(tA, tB) between keyword tA and synonym candidate tB but cannot to do so yet. Instead of calculating S, the system calculates similarity S*(AA,{AB}) between the article AA of keyword and the set of articles {AB} that was calculated by synonym candidate tB. The similarity S* of articles is derived by calculating the cosine similarity between each term vectors xA and xB.

Note that if {AB}=Φ then S*=1. This portion of program code is as follows.

To calculate the set {AB} of articles written by synonym candidate tB, the system creates TermQuery of synonym candidate for the description field fieldNameDesc and does a search. When the search result is returned as topDocs, the system collects articles except for AA as set {AB}. The system, however, does not collect {AB} itself but converts them into term vectors and accumulates them on articleVector. The system then calculates the cosine of originVector, which is a term vector of keyword article AA, and articleVector, which is a term vector of article set {AB} that is written by synonym candidate tB. The cosine calculation program is as follows.

The type of return value from getFeatureVector is Map and therefore weight for the term of String is expressed in Float. The argument stopWord receives synonym candidate tB, making sure elements of term vector x_docId do not include tB. The argument size specifies the number of term vector elements and the system collects elements as many as the number specified in the size from the largest weight (tf*idf) to smaller ones. The system therefore uses the class TermVectorEntityQueue that is an extension of Lucene PriorityQueue.

From the search above, the system decides that keyword tA and synonym candidate tB are semantically similar if the result is larger than the threshold value (minScore) calculated from S* and outputs a pair (tA,tB) to a CSV file.

Examples of Acquired Synonym Knowledge

When this program was executed to obtain synonym knowledge from Japanese Wikipedia, the system found about 11 thousand synonym entries from about 850 thousand items. Followings are Japanese synonyms actually obtained.

入国管理局, 入管
文房具, 文具
社員食堂, 社食
国際連盟, 国連
リポビタンD, リポD
:

Japanese Wikipedia has alphabet entries and the system obtained the following English synonym knowledge as well.

Applying to Other Languages and Corpora

This program is pluggably designed so that it can obtain synonym knowledge from general “dictionary corpus”. This article described how to output synonym CSV file from Japanese Wikipedia and this method can be applicable with little modification to Wikipedias of other languages including English, which has an idea of acronym. Of course, “dictionary corpora” other than Wikipedia, including electronic dictionaries and encyclopedias as well as catalogs for manufacturing industry, are all regarded as “dictionary corpora” and are applicable, giving this program wider application.