Thursday, November 3, 2011

Near-real-time readers with Lucene's SearcherManager and NRTManager

Last
time, I described the useful SearcherManager class,
coming in the next (3.5.0) Lucene release, to periodically reopen your
IndexSearcher when multiple threads need to share it.
This class presents a very
simple acquire/release API, hiding the
thread-safe complexities of opening and closing the
underlying IndexReaders.

But that example used a non near-real-time (NRT)
IndexReader, which has relatively high turnaround time
for index changes to become visible, since you must call
IndexWriter.commit first.

If you have access to the IndexWriter that's actively
changing the index (i.e., it's in the same JVM as your searchers), use
an NRT reader instead! NRT readers let you
decouple durability to hardware/OS crashes
from visibility of changes to a new IndexReader.
How frequently you commit (for durability) and how frequently you
reopen (to see new changes) become fully separate decisions.
This controlled
consistency model that Lucene exposes is a nice "best of both
worlds" blend between the
traditional immediate
and eventual
consistency models.

Since reopening an NRT reader bypasses the costly commit, and shares
some data structures directly in RAM instead of writing/reading
to/from files, it
provides extremely
fast turnaround time on making index changes visible to searchers.
Frequent reopens such as every 50 milliseconds, even under relatively
high indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRT
readers: use the constructor that takes IndexWriter
instead of Directory:

This tells SearcherManager that its source for new
IndexReaders is the provided IndexWriter
instance (instead of a Directory instance). After that,
use the SearcherManagerjust
as before.

Typically you'll set the applyAllDeletes boolean to
true, meaning each reopened reader is required to apply
all previous deletion operations (deleteDocuments
or updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. For
example, perhaps you index multiple versions of each document over
time, always deleting the older versions, yet during searching you
have some way to ignore the old versions. If that's the case, you can
pass applyAllDeletes=false instead. This will make the
turnaround time quite a bit faster, as the primary-key lookups
required to resolve deletes can be costly. However, if you're using
Lucene's trunk (to be eventually released as 4.0), another option is
to use MemoryCodec on your id field
to greatly
reduce the primary-key lookup time.

Note that some or even all of the previous deletes may still be
applied even if you pass false. Also, the pending
deletes are never lost if you pass false: they
remain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes and
others that cannot, it's perfectly fine to create two
SearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segment
in the index can be searched concurrently; this is a way to gain
concurrency within a single search request. Most applications do not
require this, because the concurrency across multiple searches is
sufficient. It's also not clear that this is effective in general as
it adds per-segment overhead, and the available concurrency is a
function of your index structure. Perversely, a fully optimized index
will have no concurrency! Most applications should pass
null.

NRTManager

What if you want the fast turnaround time of NRT readers, but need
control over when specific index changes become visible to certain
searches? Use NRTManager!

NRTManager holds onto the IndexWriter
instance you provide and then exposes the same APIs for making index
changes (addDocument/s, updateDocument/s,
deleteDocuments). These methods forward to the
underlying IndexWriter, but then return a
generation token (a Java long) which you can
hold onto after making any given change. The generation only
increases over time, so if you make a group of changes, just keep the
generation returned from the last change you made.

Then, when a given search request requires certain changes to be
visible, pass that generation back to
NRTManager to obtain a searcher that's guaranteed to
reflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and you
use Lucene to index and search all posts in the forum. Suddenly a
user, Alice, comes online and adds a new post; in your server, you
take the text from Alice's post and add it as a document to the index,
using
NRTManager.addDocument, saving the returned generation.
If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensure
her search covers all the posts she just made. Of course, if your
reopen time is fast enough (say once per second), unless Alice
types very quickly, any search she runs will already reflect
her posts.

But pretend for now you reopen relatively infrequently (say once every
5 or 10 seconds), and you need to be certain Alice's search covers her
posts, so you call NRTManager.waitForGeneration to obtain
the SearcherManager to use for searching. If the latest
searcher already covers the requested generation, the method returns
immediately. Otherwise, it blocks, requesting a reopen (see below),
until the required generation has become visible in a searcher, and
then returns it.

If some other user, say Bob, doesn't add any posts and runs a search,
you don't need to wait for Alice's generation to be visible when
obtaining the searcher, since it's far less important when Alice's
changes become immediately visible to Bob. There's (usually!) no
causal connection between Alice posting and Bob searching, so it's
fine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document and
then immediately search for it to perform end-to-end validation that
the document "made it" correctly into the index. That immediate
search must first wait for the returned generation to become
available.

The power of NRTManager is you have full control over
which searches must see the effects of which indexing changes; this is
a further improvement in Lucene's controlled consistency
model. NRTManager hides all the tricky details of
tracking generations.

But: don't abuse this! You may be tempted to always wait for last
generation you indexed for all searches, but this would result in very
low search throughput on concurrent hardware since all searches would
bunch up, waiting for reopens. With proper usage, only a small subset
of searches should need to wait for a specific generation, like Alice;
the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager,
since you should reopen at higher frequency whenever a search is
waiting for a specific generation. To address this, there's the
useful NRTManagerReopenThread class; use it like this:

The minStaleSec sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for
a specific generation (Alice, above), meaning the longest such a search
should have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequently
reopens should occur. This is used for the periodic "ordinary"
reopens, when there is no request waiting for a specific generation
(Bob, above); this means any changes done to the index more than
approximately 5.0 seconds ago will be seen when Bob searches. Note
that these parameters are approximate targets and not hard guarantees
on the reader turnaround time. Be sure to eventually
call thread.close(), when you are done reopening (for
example, on shutting down the application).

You are also free to use your own strategy for
calling maybeReopen; you don't have to
use NRTManagerReopenThread. Just remember that getting
it right, especially when searches are waiting for specific
generations, can be tricky!

38 comments:

I have tried the approach1 with SearcherManager and IndexWriter. However, the returned indexSearcher doesn't return the documents that are not committed to index. Did you miss to mention any thing in this article about that?

Do i understand it right that in general if i use NRT-approach, i. e. passing IndexWriter to the constructor of IndexReader, i need a separate thread that periodically calls IndexWriter.commit to persist changes in case of shutdown/process kill/etc. ?

I'm trying to implement the NRT - approach using the 4.0 API.The flow of the application is the following:

1. User is registered. => new document added in the index2. Just after the registration user is redirected to a details page, which he may or may not complete.3. If user completes the details => the document should be updated.

As you can see there may be a very short period of time between creating the document and searching for the document ID in order to be able to update it. This is why we decided to call maybeRefresh() without deletion on the ReferenceManager (NRT implementation) after every adDocument().

Even though I always call for a refresh, in every single test the added Document is not visible when trying to update. Is there something I'm missing?

You should use SearcherManager.maybeRefreshBlocking if you want to wait until the refresh has completed.

But it sounds like NRTManager would be a better fit here, since it allows you to only wait for the one case where this user needs to update their document, ie you can ensure the searcher you get back will reflect a specific indexing change from the past.

I have one correction: "The minStaleSec sets an upper bound on how frequently reopens should occur.should be:"The minStaleSec sets an upper bound on the time someone is required to wait before his search goes through."

We're having an issue where updates are not being picked up by the IndexReader but I'm starting to think it might be related to our particular architecture. The reading is done by a Web app but index updates are done by a completely separate process (Indexer). Once that process is done we have a Unix script that cleans up the index directory (used by the web app) and copies over the new set of files generated by the indexer.

The way we're trying to handle this on the web app side is to have a scheduled thread that wakes up every 5 mins, grabs a reference to the SearchManager (the same SearchManager used during reading) and then calls manager.maybeReopen().

When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.

I hope this explanation is clear enough. Any pointers will be greatly appreciated.

Well, actually that was somebody else from my team. Whatever I put in my original comment can shed a light on the big picture of the app we built. Like I mentioned before, reading and updating the indexes are processes done by 2 different apps running in different servers. This seems to be an atypical use case as everywhere in Lucene's docs and forums the 'normal' usage seems to be the same code handling both operations. Like my coworker explains the issue seems to be with the new index having the same filenames as the old one. This seem to cause the IndexReaders to point to the old segments.

Hmm, this (that you said above) is particularly troubling: "When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.".

If the index was newly built and copied over, and then the old IndexReader is opened, and indeed a new instance was opened, yet you are still missing documents ... I think there must be that the documents you expected were not in fact indexed? Or, are you certain a new IndexReader was actually opened?

We were finally able to fix this issue. This is our understanding of the problem and how we fixed it:- At the code level we were deleting all (previous) documents and adding a new set.- But, at the OS level we were deleting all files before thinking this was actually a safe approach.

It turns out that because of this the new Index files ended up with the exact same name as the old one. When we copied over the files and the SearchManager loaded them up we were seeing that although a new IndexReader instance was being created, the underlying readers were still pointing to the 'old' index.

The way we fixed this is that we stopped deleting files but rather let Lucene take care of the whole thing. After that we started to see new files being created and while the old files were still there the SearchManager was now able to fetch the new set of documents.

Thanks first of all, Your blogs/posts they are very useful when i hit some problem which is internal to Lucene.

Please if you can help me understand following line which i took from NRTManager class comment

"You may want to create two NRTManagers, oncethat always applies deletes on refresh and one that doesnot. In this case you should use a single {@linkNRTManager.TrackingIndexWriter} instance for both."

Does this mean one with applyDeletes=true should be used by application code which is mostly creates/updates index. and other one applyDeletes=false should be used mainly to acquire() searchers, and used by search threads in application.

This is useful if you have some requests that must show all deletions (such as incoming user searches) and other requests where it doesn't matter (e.g. if you have some automation scripts that run searches looking for specific SKUs or something)... in that case you can simply make two NRTManager instances. This is a fairly esoteric use case, though, and I would start by just making a single instance that always applies deletes and sharing that across both use cases, until/unless you hit performance issues.

I am implementing NRT and found that 4.4.0 release onwards the Near Real Time Manager (org.apache.lucene.search.NRTManager) has been replaced by ControlledRealTimeReopenThread.

Please advise should I use ControlledRealTimeReopenThread as described at http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage?answertab=votes#tab-top.

In Lucene 4.7.2, I think NRTManager is replaced with ControlledRealTimeReopenThread. As NRTManager was not available in the current release am kind of confused. Am trying out using ControlledRealTimeReopenThread but am not sure whether it will be near real-time. Can you provide some example for near real-time search using ControlledRealTimeReopenThread or it should be not used for near real-time?

Then, for each query, you determine whether it needs the "current" reader or it must wait for a specific indexing generation (because you want to ensure a certain indexing change is visible), when acquiring the searcher.

Hi Mike,I'm implementing NRTManager in c# using Lucene.Net.Contrib.Management.dll. I load all documents using an IndexWriter:<<Directory d = new RAMDirectory(); indexWriter = new IndexWriter(d, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), !IndexReader.IndexExists(d), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));

and then initialize the NRTManager with it.<<static NrtManager man = new NrtManager(indexWriter);>>

When I need to add a new entry to the manager I do this:<<doc = new Document(); doc.Add(new Field( "Info", newInfo, //dr["NickName"].ToString(), Field.Store.YES, Field.Index.ANALYZED)); // Write the Document to the catalog man.AddDocument(doc);>>

At search I ALWAYS do this:<<if (man.GetSearcherManager().MaybeReopen()) man.GetSearcherManager().Acquire().Searcher.IndexReader.Reopen();

You should not need to call Searcher.IndexReader.Reopen like that, assuming the C# port is like Lucene's. A single call to .maybeRefresh will open a new NRT reader, if there are any changes.

Also, NRTManager (renamed / factored out a while back to ControlledRealTimeReopenThread in Lucene) is only needed when you have some threads that want a "real-time" reader and other threads that are OK with the current near-real-time reader.

Maybe you should simplify your test to just use an "ordinary" SearcherManager and see if the problem still happens?

If so, there must be a bug somewhere in tracking of changes in the C# IndexWriter...

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.