Sunday, December 30, 2012

Robert has created
an exciting
new highlighter for Lucene, PostingsHighlighter, our
third highlighter implementation (Highlighter
and FastVectorHighlighter are the existing ones). It
will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications
since it's the first step of the hard-to-solve final inch problem,
i.e. of getting the user not only to the best matching documents but
getting her to the best spot(s) within each document. The larger your
documents are, the more crucial it is that you address the final inch.
Ideally, your user interface would let the user click on each highlight
snippet to jump to where it occurs in the full document, or at least
scroll to the first snippet when the user clicks on the document link.
This is in general hard to solve: which application renders the
content is dependent on its mime-type (i.e., the browser will render
HTML, but will embed Acrobat Reader to render PDF, etc.).

Google's Chrome browser has an ingenious solution to the final inch
problem, when you use "Find..." to search the current web page: it
highlights the vertical scroll bar showing you where the matches are on
the page. You can then scroll to those locations, or, click on the
highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and
end offsets per token, which are character offsets indicating
where in the original content that token started and ended. Analyzers
set these two integers per-token via the OffsetAttribute,
though some analyzers and token filters are known to mess up offsets
which will lead to incorrect highlights or exceptions during
highlighting. Highlighting while using SynonymFilter is
also problematic in certain cases, for example when a rule maps
multiple input tokens to multiple output tokens, because the Lucene
index doesn't
store the full token graph.

Unlike the existing highlighters, which rely on term-vectors or on
re-analysis of each matched document to obtain the per-token
offsets, PostingsHighlighter uses
the recently
addedpostings offsets feature. To index postings
offsets you must set the field to be highlighted to
use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
option during indexing.

It turns out postings offsets is much more efficient storage for
offsets because the default codec (currently Lucene41)
does a good job compressing them: ~1.1 byte per position, which
includes both start and end offset. In contrast, term vectors require
substantially more disk space (~7.8X for the 10 million document
English Wikipedia index), slow down indexing and merging, and are slow
to access at search time. A smaller index also means the "working
set" size, i.e. the net number of bytes that your search application
frequently hits from disk, will be smaller, so you'll need less RAM to
keep the index hot.

PostingsHighlighter uses
a BreakIterator
to find passages in the text; by default it breaks
using getSentenceIterator. It then iterates in parallel
(merge sorting by offset) through the positions of all terms from the
query, coalescing those hits that occur in a single passage into
a Passage, and then scores each Passage
using a separate PassageScorer.

The scoring model is fun: it treats the single original document as
the whole corpus, and then scores individual passages as if they were
documents in this corpus. The default PassageScorer
uses BM25
scoring, biased with a normalization factor that favors passages
occurring closer to the start of the document, but it's pluggable so
you can implement your own scoring (and feel free to share if you find
an improvement!).

This new highlighter should be substantially faster than our existing
highlighters on a cold index (when the index doesn't fit entirely into
available RAM), as it does more sequential IO instead of seek-heavy
random access. Furthermore, as you increase the number of top hits,
the performance gains should be even better. Also, the larger the
documents the better the performance gains should be.

One known limitation is that it can only highlight a single field at a
time, i.e. you cannot pass it N fields and have it pick the best
passages across all of them, though both existing highlighters have
the same limitation. The code is very new and may still have some
exciting bugs! This is why it's located under
Lucene's sandbox module.

If you are serious about highlighting in your search application (and
you should be!) then
PostingsHighlighter is well worth a look!

Sunday, December 9, 2012

Lucene's facet module, first appearing in the 3.4.0 release,
offers a powerful implementation, making it trivial to add a faceted
user interface to your search application. Shai Erera wrote up a nice overview
here
and worked through nice "getting started" examples in
his second
post.

The facet module can compute the usual counts for each facet, but also
has advanced features such as aggregates other than hit count,
sampling (for better performance when there are many hits) and
complements aggregation (for better performance when the number of
hits is more than half of the index). All facets are hierarchical, so
the app is free to index an arbitrary tree structure for each
document. With the upcoming 4.1, the facet module
will fully
support near-real-time (NRT) search.

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date
facets, indexed as
year/month/day hierarchy, to
the nightly
Lucene benchmarks. Specifically I added faceting to
all TermQuerys that were already tested, and now we can
watch
this
graph to track our faceted search performance over time. The date
field is the timestamp of the most recent revision of each
Wikipedia page.

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012)
English
Wikipedia export, which contains 30.2 GB of plain text
across 33.3 million documents. By default, faceted search retrieves
the counts of all facet values under the root node (years, in this
case):

It's interesting that 2012 has such a high count, even though this
export only includes the first five months and two days of 2012.
Wikipedia's pages are very actively edited!

The search index with facets grew only slightly (~2.3%, from 12.5 GB
to 12.8 GB) because of the additional indexed facet field. The
taxonomy index, which is a separate index used to map facets to fixed
integer codes, was tiny: only 120 KB. The more unique facet values
you have, the larger this index will be.

Next I compared search performance with and without faceting. A
simple TermQuery (party), matching just over
a million hits, was 51.2 queries per second (QPS) without facets and
3.4 QPS with facets. While this is a somewhat scary slowdown, it's
the worst case scenario: TermQuery is very cheap to
execute, and can easily match a large number of hits. The cost of
faceting is in proportion to the number of hits. It would be nice to
speed this up (patches welcome!).

I also tested a harder PhraseQuery ("the
village"), matching 194 K hits: 3.8 QPS without facets and 2.8
QPS with facets, which is less of a hit
because PhraseQuery takes more work to match each hit and
generally matches fewer hits.

Loading facet data in RAM

For the above results I used the facet defaults, where the
per-document facet values are left on disk during aggregation. If you
have enough RAM you can also load all facet values into RAM using
the CategoryListCache class. I tested this, and it gave
nice speedups: the TermQuery was 73% faster (to 6.0 QPS)
and the PhraseQuery was 19% faster.

However, there are downsides: it's time-consuming to initialize (4.6
seconds in my test), and not NRT-friendly, though this shouldn't be so
hard to fix (patches welcome!). It also required a substantial 1.9 GB
RAM, according to
Lucene's RamUsageEstimator.
We should be able to reduce this RAM usage by switching to
Lucene's fast
packed ints implementation from the current int[][] it uses today,
or by
using DocValues
to hold the per-document facet data. I just
opened LUCENE-4602
to explore DocValues and initial results look very promising.

Sampling

Next I tried sampling, where the facet module visits 1% of the hits
(by default) and only aggregates counts for those. In the default
mode, this sampling is used only to find the top N facet values, and
then a second pass computes the correct count for each of those
values. This is a good fit when the taxonomy is wide and flat, and
counts are pretty evenly distributed. I tested that, but results were
slower, because the date taxonomy is not wide and flat and has rather
lopsided counts (2012 has the majority of hits).

You can also skip the second pass and then present approximate counts
or a percentage value to the user. I tested that and saw sizable
gains: the TermQuery was 248% (2.5X) faster (to 12.2 QPS)
and the PhraseQuery was 29% faster (to 3.6 QPS). The
sampling is also quite configurable: you can set the min and max
sample sizes, the sample ratio, the threshold under which no sampling
should happen, etc.

Lucene's facet module makes it trivial to add facets to your search
application, and offers useful features like sampling, alternative
aggregates, complements, RAM caching, and fully customizable
interfaces for many aspects of faceting. I'm hopeful we can reduce
the RAM consumption for caching, and speed up the overall
performance, over time.

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.