Using Lucene in Grails

Apache Lucene is the leading open source search engine and is used in many businesses, projects and products. Lucene has sub-projects which provide additional functionality such as the Nutch web crawler and the Solr search service. This article gives an introduction to Lucene, a tutorial on three Grails Lucene plugins and a comparison between them.

Lucene

Apache Lucene Core provides a Java-based indexing and search implementation. In essence, the index is made up of documents that are comprised of fields.

Indexing

When you index a document, the fields are processed through a tokenizer to break it into terms. Figure 1 shows the effect a whitespace tokenizer would have on splitting “Mary had a little lamb” into the tokens: Mary, had, a, little, lamb.

Figure 1: Whitespace tokenization

The terms may undergo further analysis through TokenFilter classes. Figure 2 shows a stop words filter removing common terms that would skew relevancy.

Figure 2: Stopwords

Figure 3 shows the PorterStemFilter applying the (English) Porter stemming algorithm to reduce tokens to their word stems so that they are equated. This TokenFilter needs to work on lower case input – so the LowerCaseFilter / Tokenizer should be used before the stemming.

Figure 3: Word stemming

Lastly, the terms are then mapped to their documents as shown in Figure 4.

Figure 4: Term index

Index updates

Updates to index documents are handled as a delete operation followed by an add operation. Over time the index segments can become fragmented – this can be cured by running an optimization operation to pack the index.

Querying

The queries in Lucene need to pass through the same analyzers as were used during indexing – otherwise identical terms might not match.
A single word query (TermQuery) requires a lookup in the term index to return the matching documents.

e.g. querying the term index shown in Figure 4 for ‘consignment’ would return documents 1, 4 and 7.

A two word query (BooleanQuery) requires Lucene to perform two lookups in the term index then filter the resulting documents on either an AND or an OR basis (from an explicit or default operator). e.g. querying the Figure 4 term index for ‘consign AND ship’ would return document 4, whereas ‘consign OR ship’ would return documents 1, 4 and 7.

A phrase query is denoted by double quotes (e.g. “new york”) and matches documents containing a particular sequence of terms. This relies on positional information in the index. Phrase slop, or proximity, is specified using the ~ operator and an integer to specify how close the terms in the phrase need to be together. e.g. “big banana”~5 would match documents containing “big banana”, “big green banana” and “big straight yellow banana”

By default results are returned in relevancy order, and the score is calculated by a formula (if you’re really interested it is in the JavaDoc for the Similarity class – http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/Similarity.html). The score can be influenced by boosting terms. e.g. the query ‘subject:lucene OR author:bramley^2’ would boost the score contribution of the author field two-fold for documents whose author fields contained ‘bramley’.

Tools

Before we get onto the practical application it is worth mentioning that Luke, the Lucene Index Toolbox, is an invaluable tool for allowing inspection, searching and browsing of Lucene indices.

It is available from http://code.google.com/p/luke/ – but be aware that you need to use the right version to match the version of Lucene that created your indices. This has been made easier and more obvious as the version numbers of Luke now correspond directly with Lucene versions (e.g. Luke 3.3.0 is for Lucene 3.3.0, however Luke 0.9.9.1 is based on Lucene 2.9.1).

Implementing in Grails

Rather than a Twitter-style application, we’ll build a simple To-Do list application with one domain class to represent the to-do Item, shown in Listing 1, which will use a generated controller and views (you may choose to enable scaffolding instead).

We’ll use this along with the Figure 5 test data in 3 similar applications to showcase the different Lucene-related Grails plugins. All the sample applications are available on GitHub under https://github.com/rbramley/GroovyMagLucene

Searchable plugin

The Searchable plugin makes use of Compass::GPS to index Hibernate domain objects using GORM/Hibernate lifecycle events. The plugin also provides a default search controller and view. The aim, which lives up to Marc Palmer’s recommendations, is to make full text searching of your domain objects as simple as possible.

The first step is to install the plugin using grails install-plugin searchable.

Once we’ve created the domain class (grails create-domain-class com.rbramley.todo.Item) and populated it with the code from Listing 1, we then add a static searchable property to the domain class: static searchable = [except: 'dateCreated']

Running the application, select the com.rbramley.todo.ItemController link from the home page, and enter the test data from Figure 5.

Return to the home page and then select the grails.plugin.searchable.SearchableController link.

Enter the query ‘timesheet’ in the search box then click the ‘Search’ button – this will give you the results as shown in Figure 6.

Figure 6: Searchable results

There you have it – quick and easy full text search on your domain model. Whilst the style may not match with our application, at least it gives us something to work from.

Note that the plugin stores the indices under ~/.grails/projects/<project-name>/searchable-index/<environment>

Solr plugin

Version 0.2 of the plugin was released in January 2010, is compatible with Grails 1.1+ and bundles an old 1.4 release of Solr and SolrJ. It would be nice to see a new version of this plugin with a current Solr release and updated for the newer dependency resolution mechanisms – if you’ve got time to help out you can fork it on GitHub at http://github.com/mbrevoort/grails-solr-plugin

Once you’ve installed the plugin using ‘grails install-plugin solr‘, the ‘grails start-solr‘ script will start up a default Solr instance using Jetty. This is located in ‘solr-home‘ under the project working directory (based on the Grails build settings) using the example Solr schema and configuration. Whilst this example configuration is fine for evaluation, I (and Lucid Imagination) would strongly recommend you don’t use this configuration in production. For instance, you’ll need to modify this if you want to use the dismax handler (at which point you’re probably better off with a separate installation and version controlled configuration).

The plugin makes good use of convention (leveraging Solr dynamic fields) and meta-programming to add methods to domain classes.

How does it do auto-indexing?

If you check out the plugin doWithApplicationContext closure, you will see that a listener is registered for the post- insert/update/delete events.

What about search?

This is handled by the SolrService which uses the SolrJ client and constructs a simple query (here it would be nice to leverage dismax). However the plugin author (Mike Brevoort) has also helpfully provided the ability to construct your own advanced query and supply that to the SolrService.

Now we’ve uncovered how it works out of the box – it’s time to write some code so we can take it for a test drive.

The plugin can be installed by grails install-plugin solr.

Again we’ll use the domain class from Listing 1, but this time adding the two properties, static enableSolrSearch = true and static solrAutoIndex = true

Once the domain class is completed we can generate the controllers & views:

grails generate-all com.rbramley.todo.Item

We’ll add a search box to the main.gsp layout. This is shown in Listing 2 and styling has been left as an exercise for the reader (note you can supply an image for the search button). This submits to the search controller (created using grails create-controller com.rbramley.todo.Search – and populated by Listing 3) with the search.gsp displaying the results (Listing 4 shows a fragment).

Before you start the Grails application, you need to start Solr using grails start-solr.

Once Solr and then Grails have started, we can create an item using the Figure 5 test data, then search for it using ‘timesheet’.

Why are there no results?

The plugin doc states that it will search against the default Solr field (which by default is called ‘text’).

Trying again: subject_s:timesheet also returns no results…

So let’s try a phrase query: subject_s:”Complete timesheet” as shown in Figure 7 – this one works because we’ve supplied the whole string to match, this is due to“The StrField type is not analyzed, but indexed/stored verbatim.”

So to get the desired results we’ll need to do some Solr configuration, and we have the following options:

Map the fields to text type so that they are analyzed (can also be forced to text through annotations) and use copy field directives to copy them to the default text field

Configure a dismax handler (beyond the scope of this article) to have more flexible control over the fields that are queried by default

If we change the Solr schema, we’ll need to do a full re-index – or in our case with a test application in development mode, we can just delete the index files (typically in ~/.grails/<grails-version>/projects/GroovyMagSolr/solr-home/solr/data).

First we’ll modify Item.groovy to add import org.grails.solr.Solr and then add the Solr annotation to the String subject field e.g. @Solr(asText=true) – this will cause the field to be indexed as subject_t. We’d now be able to search for subject_t:timesheet and get a match – but this still doesn’t meet our usability requirements as it requires knowledge of the underlying document fields. We could use an explicit <copyField source="subject_t" dest="text"/> in the Solr schema.xml, however if you inspect the schema.xml you will see that the first 3000 characters of all ‘_t’ fields are copied to the ‘text’ field.

Now a search for ‘timesheet’ gives the results shown in Figure 9.

Figure 9: Solr simple query result

ElasticSearch plugin

ElasticSearch (http://www.elasticsearch.org/) is a new Lucene-based distributed, RESTful search engine. It was created by Shay Banon, who created Compass (used by the Searchable Plugin) and has worked on data grid technologies.

Version 0.2 of the Grails plugin uses ElasticSearch 0.15.2 – you can start with an embedded instance in development mode which stores the indices in the project source directory under ‘data’.

The first step (within a clean application) is to install the plugin:

grails install-plugin elasticsearch

Then create the domain class (grails create-domain-class com.rbramley.todo.Item) and fill in using the Listing 1 domain class code with the addition of Listing 5.

static searchable = {
except = 'dateCreated'
}

Listing 5: ElasticSearch mapping

Once that is done and the controller and views generated (grails generate-all com.rbramley.todo.Item), we can then create our Search controller (grails create-controller com.rbramley.todo.Search) and complete it using Listing 6.

Having started up the application and entered the test data as shown in Figure 5, I hit a problem in that the database record was created but the index record wasn’t and the console was filling up with repeated log entries until I stopped the application:

When I looked at the IndexRequestQueue source, the JavaDoc says “If indexing fails, all failed objects are retried. Still no support for max number of retries (todo)”.

The NullPointerException itself is ElasticSearch bug 795 – but the underlying cause seems to be that the JSON document didn’t meet the expectations of ElasticSearch for that type!

I got the plugin working by changing the domain class searchable mapping to only = 'subject' – then the results look identical to Figure 9.

In the time available I didn’t manage to track down the initial issue (I also tried with an external ElasticSearch instance and upgrading the ElasticSearch dependencies to 0.16). I’ll be in communication with the plugin authors…

Plugin comparison

So how do they compare? Well this article has given a basic introduction to their usage for English text and hasn’t demonstrated advanced features or the distributed capabilities of the latter two.

The criteria for comparison is partially inspired by Marc Palmer’s views on plugins – a distilled form is “make it work, make it simple, make it magic”; we’ll also add scalability to this list.

Searchable

This is a well established plugin and works very well out of the box with simple domain classes and even relationships. It is simple to use, works as expected and provides a basic search page and controller which includes some administrative actions such as the ability to re-index all searchable domain classes.

I’ve occasionally encountered older versions throwing exceptions on start-up that could be rectified by removing the indices and restarting the application.

On the downside, due to the embedded nature of the indices, it is only really suitable for single-instance applications.

Solr

This plugin requires some configuration to get the best out of it and it could benefit from some attention to update it. However, it does have reasonable documentation and some powerful features such as faceted-search, spatial search and taglibs for facets and result links.

Also it should be possible to index domain classes that use Mongo through the use of the metaClass added indexSolr() method.

ElasticSearch

This plugin has great promise – although I encountered some issues relating to the ElasticSearch expectation of the index document structure, I should mention that the plugin documentation currently contains the warning “you should only use this plugin for testing”.

The main attraction of ElasticSearch is the real time search with a distributed and scalable nature.

As per the Solr plugin, it should also be possible to index Mongo-mapped domain classes using a metaClass added index() method.

3 responses to “Using Lucene in Grails”

﻿Hey.
I just found your web site: Using Lucene in Grails | Lean Java Engineering
when I was surfing around stumbleupon.com. It looks as though
someone appreciated your blog so much they decided to bookmark it – good job!