Lucene 4 is Super Convenient for Developing NLP Tools

26/10/2012

Author: Koji Sekiguchi

The other day, I wrote a system that automatically obtains synonym knowledge from dictionary corpus. Dictionary corpus is a collection of entries that consist of “keywords” and their “descriptions”. Put simply, it’s a dictionary. Familiar examples of dictionary corpus are electronic dictionaries and Wikipedia data. You can also say that the combination of “item name” and “item description” in EC sites are dictionary corpus.

I originally wrote this system because I wanted to use Wikipedia to automatically create synonyms.txt that is used for SynonymFilter of Lucene/Solr. SynonymFilter of Lucene/Solr can use the output CSV file but the system itself actually uses Lucene 4.0 inside of it.

I always thought that Lucene 4.0 is convenient for developing NLP tools and reaffirmed the impression after substantiating the assumption.

Lucene 4.0 classes that I used for developing this system are as follows:

IndexSearcher, TermQuery, TopDocs
This system calculates similarities of synonym candidates that consist of nouns extracted from keywords and their descriptions. The system determines that the candidate is a synonym of keyword if similarity is bigger than a threshold value and output it to a CSV file.
But how I calculate the similarity of a keyword and its synonym candidate. This system determines the similarity by calculating the similarity of keyword description Aa and dictionary entry description set {Ab} that are written using synonym candidates.
Thus, I have to find {Ab} where I used classes such as IndexSearcher, TermQuery, and TopDocsto to search description field using synonym candidate.

PriorityQueue
Next, I have to pick out “feature word” from Aa and {Ab} to calculate similarity of the two. In order to do so, I select N most important words to structure feature vector. Here, I use TF*IDF of the target word as their degree of importance. See the above SlideShare for the detail. Here, I use PriorityQueue to select “N most important words”

DocsEnum, TotalHitCountCollector
I used TF*IDF to calculate weight to extract the above feature word and used DocsEnum.freq() to obtain TF. docFreq (number of articles including synonym candidate), which is a required parameter to obtain IDF, has been calculated by passing TotalHitCountCollector to the search() method of IndexSearcher.

Terms, TermsEnum
I use these classes to search “description” field for synonym candidates.

These are usage examples for Lucene 4.0 on this system. I also believe Lucene will be a great help for NLP tool developers as well. For lexical knowledge obtention task using Bootstrap, for example, I can use a cycle (1: pattern extraction, 2: pattern selection, 3: instance extraction, 4: instance selection) to obtain knowledge from a small number of seed instances. I believe that you can replace pattern extraction and instance extraction with a simple search task if you use Lucene for these tasks.