Nuxeo, Search and Lucene, Oh My!

Content Repository & Search

At Nuxeo, our job is to provide a content repository, a stack of associated services and the needed assembly tools so that people can build their own content-centric business application in a clean, maintainable and scalable way.

When it comes to scalability, the first challenge is usually not so much the storage volume but the ability to quickly execute search queries.

Indeed, most of the screens of applications heavily rely on queries: searching for documents under a given hierarchy, in a given state or associated to a task...

As a result, pretty much all screens of the application are issuing one or several queries.

In addition, platform users can configure both the data structures and the screens so the used queries can vary a lot and can become very complex.

Our History with Lucene

We've been building a content repository for more than 10 years, so we know about this search challenge. Since everything we do is open source, we have a long history with Apache Lucene.

Strangely enough, we started using Lucene at a time where our repository was running on Python/Zope. We used PyLucene to build an XML-RPC search service. This hybrid solution was a pain to setup (compile Lucene to native code via GCJ), but when it was finally up and running the performance was just amazing.

When we moved the whole platform to Java, Lucene was logically one of the building blocs we re-used. At this time, our repository backend was Apache JackRabbit and since the built-in Lucene index was not enough to fulfill our requirements for complex queries, we integrated Compass Core that was providing a transactional layer on top of Lucene (and had an interesting lead developer). This Lucene integration was not so successful for us since we ended up with a lot of missing sync and transaction deadlocks issues at the index level.

Along with Jackrabbit limitations, we decide to re-write completely the repository implementation, 100% SQL based and then 100% ACID.

The result is a very reliable storage where everything relies on the SQL database.

However, there are some limitations in terms of queries:

full-text capabilities are poor in most databases

some queries can not be fast

scale out is complex

So in 2013, we started implementing Elasticsearch connector for the Nuxeo Platform with the goal to get rid of these problems: spend less time on complex SQL tuning and focus on Elasticsearch index mapping.

Elasticsearch Integration Overview

Hybrid Storage

The idea was to build an additional index on top of the repository:

Nuxeo Repository keeps its primary index based on SQL

ACID but limited in power

Elasticsearch will provide an additional index

full featured index, but asynchronous

One of the main advantages of this approach is that queries are written once using NXQL and then, depending on the configuration, it will be executed by the repository or the Elasticsearch index. On a per query basis, transactional behavior or search speed can be favored.

Later, when we introduced the MongoDB backend for the repository, this demarcation became synchronous vs asynchronous, but still, Elasticsearch remains the solution for blazing fast search.

How do we know that? Well, we did actual benchmarks!

Performances

As soon as we had a first version of the nuxeo-elasticsearch connector, we started doing performances benchmarks.

One of the first tests we did was comparing the performances the same NXQL query between a Nuxeo Repository internal SQL index and the Elasticsearch index.

The results are very significants:

Response time is far smaller and remains stable under load.

Throughput increases linearly under load.

Happy with these first results, we finished the integration and did some more benchmarks.

One of the key performance aspects is re-indexing speed:

because it is useful for initial migration

because we have several use cases for recurse re-indexing

We tested re-indexing speed and measured a throughput of about 3,500 documents/s. This is fast enough so that a full re-indexing would not be a problem. But, we actually saw that the bottleneck for the re-indexing was not Elasticsearch, but the repository SQL backend. So, we ran again the same test with the MongoDB backend and measured a re-indexing throughput of 10,000 documents/s.

We also tested the Elasticsearch capacity to handle scale out. For that, we injected a lot of queries on a Nuxeo Repository via REST API. The limit reached was about 3,000 queries/s. Then adding a second Elasticsearch node gave us about 6,000 queries/s, showing a linear scale-out capability.

During these tests, we leveraged the fact that Elasticsearch keeps a copy of the source JSON document stored inside the index: this allows us to off-load the query and the retrieval from the database to Elasticsearch.

This is magic!

Within the technical team, we were very impressed by the Elasticsearch technology and how well everything was working.

Then, the Solutions Architects team started playing with it and they discovered that in addition of high-end performances, Elasticsearch also brings a lot of new features (more on this later).

Finally, it was our client's turn to discovered Elasticsearch. For several of them, solving their performances or scalability concerns was as simple as to say "please activate the nuxeo-elasticsearch plugin!".

This is so true, that we integrated the nuxeo-elasticsearch plugin as the default deployment option in Nuxeo 6.0 and also "back-ported" to make it available for people using the previous LTS.

Technical Integration of Elasticsearch

We knew from Elasticsearch previous work on integrating Lucene that the main challenges we had to address were:

Handle security filtering

Keep the index synchronized with the repository

Mitigate eventually consistency side effects

Security Filtering

Inside the Nuxeo Repository, the documents are associated with security descriptors (ACLs) that are used to determine "who can do what" on the document. This is also true when it comes to viewing a document, so this also applies to search: search results must be filtered according to user credentials.

Doing post-filtering is not an option as we would loose all the added-value of Elasticsearch (speed, aggregates...).

In the past, we also tested solutions based on joining Lucene Indexes together: this ended up being a pain and since anyway it was not compliant by Elasticsearch distributed architecture, this option was ruled out.

So, we decided to index the ACLs inside Elasticsearch. For that, we build a synthesis of the users and groups who have access to the document and we make it part of the JSON representation of the document.

Because the ACLs can be inherited, a change of security will usually trigger a recursive re-index, putting more pressure on the infrastructure that must keep the index in sync with the repository.

Keeping the Index in Sync

Our past experience with Lucene has led us to believe that trying to make Lucene transactional is a mistake.

So, we decided to accept this limitation and build a system that will make the indexing safe even if asynchronous and non-transactional.

The idea is simple:

collect events occurring during the transaction

de-duplicate them if needed

if the transaction on the repository commits: create an asynchronous job to update Elasticsearch

To make this safe, the asynchronous jobs must be persistent and re-triable, but since the Nuxeo Platform uses Redis to back the processing queues, this is not an issue.

Mitigating Consistency Issues

Search is executed against data structures that require some processing to build, which means it is best to batch up some documents before building one. So while Elasticsearch guarantees that documents can be retrieved by ID immediately, it may take a second before Lucene makes the same document visible to search.

This is actually a very good trade-off considering the speed it provides and the very short time interval during which problems may be visible.

Still, there are some cases where this can be an issue. If a user lists documents to process and validate one of them, he logically expects this document to disappear from the list. But when the list is built using Elasticsearch, there might be a small "refresh issue" because the index is not updated in real time.

To mitigate this issue, we added a "pseudo real time mode": when the indexing events are caused by a "UI Thread", the indexing jobs is run on the afterCompletion of the repository transaction and will issue a refresh().

Results in Production

We have some customers with past bad experience with Compass/Lucene and logically they were worried about index consistency when their fist Nuxeo + Elasticsearch clusters went live.

So, we built some tools to check the Elasticsearch index and verify that it is really "in sync" with the repository. Having these tools was useful to make everyone more relaxed, and we did not find any real issue.

The feedback of our customers regarding Elasticsearch integration is very good.

Additional Benefits

Integrating Elasticsearch into the Nuxeo Platform gave us a way to really boost the performance of the platform.

However, we also quickly saw that Elasticsearch can provide us more than raw performances.

Aggregates

Once we had integrated Elasticsearch as a backend for NXQL Queries, it became possible to leverage the aggregate feature.

To do so, we extended the PageProvider model (our model for named queries) to allow to define aggregates and the result is a powerful Faceted Search feature.

Advanced Indexing

Elasticsearch index semantic is much richer than what we had inside a SQL Database.

Audit Trail

Audit Log is a typical nightmare for an SQL database: write an intensive and huge amount of free form entries.

Here we can use Elasticsearch as a primary storage for what he does the best a write once, search many. In addition, it allows for very cool queries mixing documents and history.

Statistics

This open the way for doing real-time data analytics on the documents and Audit Log that are available in Elasticsearch.

For this, we added a Nuxeo Elasticsearch pass-through that basically re-exposes the Elasticsearch HTTP on top of nuxeo-elasticsearch, the main goal being to integrate the security filtering.

We use this feature to provide configurable dashboards for workflows and searches:

About the Next Steps

Because we are very happy with the features we gained from integrating Elasticsearch, we want to go further.

The roadmap includes:

upgrade to Elasticsearch 2.0!

add more automatic de-normalization

use Elasticsearch to search through relations for example

leverage Elasticsearch percolator feature

be able to notify users when a change in the repository impacts them

integrate Shield

provide index level security for people using Nuxeo in a multi-tenant environment

Thierry Delprat joined Nuxeo in 2005 as Chief Technology Officer. As CTO, he guides the architectural development of the Nuxeo Platform including the adoption of Java as the platform for innovation.
Prior to joining Nuxeo, Thierry worked for over 7 years at Unilog, with progressively senior experience across different branches of the consulting company. He was also a technical architect at Cryo-Networks (infrastructure for online games), and has participated in start-up companies.
Thierry graduated from the Ecole Centrale de Nantes.