Search Posts

Lucene Highlighter Tutorial with Example

Get link

Facebook

Twitter

Pinterest

Email

Other Apps

The post explains how to implement search terms Highlighter using Apache Lucene 5.1 along with example code. When users search, they want to search in minimum time. So the techniques that facilitate users to search fast are important for better user search experience, highlighter is one of those techniques.

HighLighter performs two functions:

It makes the terms bold in search result which were part of user query, so that user can identiy and quickly review the result.

If your document text is long, Highlighter also select best fragment of text that contains the search keywords, so that user could read 2-3 lines of document to decide whether exploring the link further would help.

Using Google, you must have noticed, Google highlight the query keywords making them bold and also select a particular fragment of text from the description that is stored in Google about that articles. As show below:

Notice, there are three query terms: java, inheritance and bitspedia. In result Google has highlighted these terms in URL and description. In this article we want to achieve same functioanlity using Lucene search engine library.

Lucene Indexing Process

To search something using Apache Lucene, we need to create an index of data. Then we run the search operation on that index. So lets first create an index of some data:

The constructor instantiate IndexWriter object that is used to create index. Analyzer helps to create right tokens or keywords from given text. Without Analyzer, IndexWriter can't create the index. For example, if you see a index at the end of a book, its contains keywords used in the book. So keyword identification is required before the indexing process.

Apache Lucene provide different type of Analyzers and mechanism to plug custom Analyzers, StandardAnalyzer extract tokens out of the text, lower case the tokens, eliminates common words and punctuations, etc. So StandardAnalyzer is very helpful for common search cases.

The createIndex method actually creates the index using indexWriter and data (given in the form to Document objects). The Document is Lucene provided class, we create Document objects and pass to indexWriter object. Each Document consist of multiple fields. I have added only one TextField to keep the example simple. Later we would create the Indexer object and invoke createIndex method to create the index. Here is how we would do the indexing using above created code:

Lucene Search Process

Lets make Search component that we could use to search keywords on above created Index. The primary class used to search the index is IndexSearcher. We instantiate this object passing INDEX_DIRECTORY_PATH. Then we an search information placed in the specified index using keywords. Below code creates the IndexSearcher and expose two methods i.e. search (to search) and getDocument (to retrieve a specific document by id).

The Searcher class constructor instantiate the IndexSearcher object on the index we created earlier. The search method receives Query and an integer parameter that represent the maximum number of documents to retrieve. Document IDs are returned along with relevance score, but not the actual Documents. The "doc" method is used to retrieve the actual Document, which takes the Document ID.

Lets prepare the IndexSearcher by adding another method in LuceneHighlighter class:

In above code, I created Searcher object passing INDEX_DIRECTORY_PATH. The QueryParser represent the query in Lucene understandable format. There are 3 types of information which are important from querying perspective:

1. Analyzer object, so that Lucene code analyze the query string
2. The field name on which search should be operated, in our case its "title"
3. The actual query, see "java action" above

We pass this query to searcher's search method which returns the TopDocs. Apache Lucene sort the returned results based on relevance, by default. We can change the sort parameter, that would see in different article. So far it search the titles but do not highlight search keywords. Now we are ready to discuss the core objective i.e. how to use Lucene Highlighter.

Lucene Highlighter

Lets add another method highlightSearchKeywords in our LuceneHighlighter class that use scoreDocs, query and other Lucene components that provides highlighting. Lets first see the code sample, then I would explain how it works and the purpose of different classes used:

First you must understand HighLighter not only highlights keywords but also select the best text fragment if our field value (e.g. "title") is large. In STEP A, I have also discussed query object. The Scorer is used to gain stream of tokens, the QueryScorer scores text fragments by the number of unique query terms found. Then I create a Fragmenter that breaks text into multiple fragments for consideration for HighLighter. Highlighter later choose a best fragment to show. Then I created HighLighter object using Scorer and Fragmenter objects. So the take away note is, Highlighter chooses best text fragment to show and also highlight the keywords in that fragment.

In STEP B, I have created IndexReader, which is used to read the index, as explained in Searcher section above.

IN STEP C, we do the actual work. I use indexReader, scoreDoc, field-name, document and analyzer to create the token stream which HighLighter uses (in addition to actual content) to identify the best fragment of text. Here is the code snippet that runs the whole process.