But there is a serious yet often overlooked problem with this
approach. To see it, you have to put yourself in the shoes of a user.
Imagine Alice comes to your site, runs a search, and is looking
through the search results. Not satisfied, after a few seconds she
decides to refine that first search. Perhaps she drills down on one
of the nice facets you presented, or maybe she clicks to the next
page, or picks a different sort criteria (any follow-on action will
do). So a new search request is sent back to your server, including
the first search plus the requested change (drill down, next page,
change sort field, etc.).

How do you handle this follow-on search request? Just pull the latest
and greatest searcher from
your SearcherManager
or NRTManager
and search away, right?

Wrong!

If you do this, you risk a broken search experience for Alice, because
the new searcher may be different from the original searcher used for
Alice's first search request. The differences could be substantial,
if you had just opened a new searcher after updating a bunch of
documents. This means the results of Alice's follow-on search may
have shifted: facet counts are now off, hits are sorted differently so
some hits may be duplicated on the second page, or may be lost (if
they moved from page 2 to page 1), etc. If you use the new (will be
in Lucene
3.5.0) searchAfter
API, for efficient paging, the risk is even greater!

Perversely, the frequent searcher reopening that you thought provides
such a great user experience by making all search results so fresh,
can in fact have just the opposite effect. Each reopen risks breaking
all current searches in your application; the more active
your site, the more searches you might break!

It's deadly to intentionally break a user's search experience: they
will (correctly) conclude your search is buggy, eroding their trust,
and then take their business to your competition.

It turns out, this is easy to fix! Instead of pulling the latest
searcher for every incoming search request, you should try to pull the
same searcher used for the initial search request in the session.
This way all follow-on searches see exactly the same index.

Fortunately, there's a new class coming in Lucene 3.5.0, that
simplifies this: SearcherLifetimeManager. The class is
agnostic to how you obtain the fresh searchers
(i.e., SearcherManager, NRTManager, or your
own custom source) used for an initial search.
Just like
Lucene's other
manager classes, SearcherLifetimeManager is very
easy to use. Create the manager once, up front:

SearcherLifetimeManager mgr = new SearcherLifetimeManager();

Then, when a search request arrives, if it's an initial (not
follow-on) search, obtain the most current searcher
in the
usual way, but then record this searcher:

long token = mgr.record(searcher);

The returned token uniquely identifies the specific
searcher; you must save it somewhere the user's search results, for
example by placing it in a hidden HTML form field.

Later, when the user performs a follow-on search request, make sure
the original token is sent back to the server, and then
use it to obtain the same searcher:

As long as the original searcher is still available, the manager will
return it to you; be sure to release that searcher
(ideally in a finally clause).

It's possible searcher is no longer available: for example if Alice
ran a new search, but then got hungry, went off to a long lunch, and
finally returned then clicked "next page", likely the original
searcher will have been pruned!

You should gracefully handle this case, for example by notifying Alice
that the search had timed out and asking her to re-submit the original
search (which will then get the latest and greatest searcher).
Fortunately, you can reduce how often this happens, by controlling how
aggressively you prune old searchers:

mgr.prune(new PruneByAge(600.0));

This removes any searchers older than 10 minutes (you can also
implement a custom pruning strategy). You should call it from a
separate dedicated thread (not a searcher thread), ideally the same
thread that's periodically indexing changes and opening new searchers.

Keeping many searchers around will necessarily tie up resources (open
file descriptors, RAM, index files on disk that
the IndexWriter would otherwise have deleted). However,
because the reopened searchers share sub-readers, the resource
consumption will generally be well contained, in proportion to how
many index changes occurred between each reopen. Just be sure to
use NRTCachingDirectory, to ensure you don't bump up
against open file descriptor limits on your operating system (this
also gives a good speedup in reopen turnaround time).

No comments:

Post a Comment

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.