Do you know how your Lucene indexes are performing?

tl;dr; Plumbr now monitors Lucene indexes. Go grab your free trial and find out how your Elasticsearch/Solr/Liferay installation is doing in regards of full text search performance. For those with longer attention span than seven seconds, the motivation and design concepts for our solution, along with a real-world use case will follow.

Plumbr is all about reducing the time spent to troubleshooting performance issues. Our automated root cause detection means that whenever the user experience starts to degrade, Plumbr will capture the required information from within the application. As a result, Plumbr users are exposed to the root cause, down to the single line in source code.

The list of different technologies that can and will end up being the cause for poor performance is long. The list of technologies monitored by Plumbr root cause detection is also growing longer each month. We are glad to announce that since March 2016, Plumbr is capable of monitoring the performance of Lucene indexes.

While analyzing the landscape before building the support for the said full-text search library, we were surprised twice.

First surprise was triggered by the fact that Lucene indexes are everywhere. The library is embedded to document management systems like Liferay and Alfresco and generic search services like Elasticsearch or Solr. It is also found in many frequently deployed Java products like Atlassian JIRA & Confluence stacks. So even if you have not directly embedded Lucene into your application, chances are that your infrastructure is still relying on it.

Second surprise was related to the frequency of performance issues caused by poorly performing Lucene indexes. In addition to being deployed everywhere, each and every such deployment also contained performance issues. In all but the most trivial cases, whenever the application monitored made use of Lucene, it was doomed to contain poorly configured indexes.

Having confidence in both severity and frequency of the issue at hand, we proceeded with building the monitoring solution.

Real-world example

Plumbr Lucene support is best illustrated via an example. Let me expose a particular root cause captured by Plumbr during the beta testing phase:

As seen, the culprit being responsible for 11,320 transactions being slow has been the access to org.apache.lucene.store.NRTCachingDirectory.doc(?,?) index with 742,599 documents in it. Considering that this particular index is in-memory and the size per document was approximately 21KB, the full index would require 16GB of memory. Coupling this with the fact that the JVM in question had access to just 8G of heap, part of the problem is already staring us right in the face. The index is just too big to be cached; meaning the access to non-cached documents will end up loading the documents from the file system instead of the cache.

Adding insult to the injury, such cache misses combined with certain caching policies (LRU as an example) will build the foundation for the trashing in cache. In such situation the items are repeatedly loaded and purged from the cache, adding a lot of pressure to the JVM from the GC perspective as well.

Possible solution would need some insights about the data being cached – it is likely that not all of the 16G is used with equivalent frequency, so understanding the cached data and rethinking the caching strategy is a good place to start.

Second part of the information exposed by Plumbr is visible in the next section from the very same screenshot:

In addition to accessing a huge in-memory index, the code tries to load specific fields from the document by passing “fieldsToLoad” parameter. Examples of the fields passed via “fieldsToLoad” parameter include invoiceDate, invoiceNumber and txID as seen from above.

With the information extracted by Plumbr we can see that those fields are not indexed (“indexOptions”: “NONE”). As a result, fetching the entries will slow down even more. As a solution, the fields loaded should be indexed with either DOCS_AND_FREQS or DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS considering the frequency this index is being used.

Take-away

Don’t get me wrong, I do believe Lucene is a rather well-crafted search library. But the sheer amount of performance issues detected during our beta phase was all the proof I needed to confirm that the problem is both real and widespread.

Being proud of my craft, I can only recommend that when your applications make use of Lucene-based solutions, just take Plumbr out for a test drive and see how many of your users are actually impacted.

Independent of whether you are consuming the index directly or via the infrastructure components, such as Solr or Elasticsearch, you would be zoomed right into the root cause, exposing:

The impact poorly performing Lucene indexes have on your end users

Actual root cause, down to a single line in source code accessing the index

Information about the index accessed, including the index size, accessed fields, accessor methods and more.