Thursday, December 12, 2013

Fast range faceting using segment trees and the Java ASM library

In Lucene's facet module we recently added support
for dynamic
range faceting, to show how many hits match each of a dynamic set
of ranges. For example, the Updated drill-down in the
Lucene/Solr
issue search application uses range facets. Another
example is distance facets (< 1 km, < 2 km, etc.), where the distance
is dynamically computed based on the user's current location. Price
faceting might also use range facets, if the ranges cannot be
established during indexing.

To implement range faceting, for each hit, we first calculate the
value (the distance, the age, the price) to be aggregated, and then
lookup which ranges match that value and increment its counts. Today
we use a simple linear search through all ranges, which
has O(N) cost, where N is the number of ranges.

But this is inefficient!

Segment trees

There are fun data structures
like segment
trees and
interval
trees with O(log(N) + M) cost per lookup, where
M is the number of ranges that match the given value. I
chose to explore segment trees, as Lucene only requires looking up by
a single value (interval trees can also efficiently look up
all ranges overlapping a provided range) and also because all
the ranges are known up front (interval trees also support dynamically
adding or removing ranges).

If the ranges will never overlap, you can use a simple binary search;
Guava's
ImmutableRangeSet takes this approach. However,
Lucene's range faceting allows overlapping ranges so we can't do
that.

Segment trees are simple to visualize: you "project" all ranges down on
top of one another, creating a one-dimensional
Venn diagram,
to define the elementary intervals. This classifies the
entire range of numbers into a minimal number of distinct ranges, each
elementary interval, such that all points in each elementary interval
always match the same set of input ranges. The lookup process is then
a binary search to determine which elementary interval a point belongs
to, recording the matched ranges as you recurse down the tree.

Consider these ranges; the lower number is inclusive and the upper
number is exclusive:

0: 0 – 10
1: 0 – 20
2: 10 – 30
3: 15 – 50
4: 40 – 70

The elementary intervals (think Venn diagram!) are:

-∞ – 0
0 – 10
10 – 15
15 – 20
20 – 30
30 – 40
40 – 50
50 – 70
70 – ∞

Finally, you build a binary tree on top of the elementary ranges, and
then add output range indices to both internal nodes and the leaves of
that tree, necessary to prevent adversarial cases that would require
too much (O(N^2)) space. During lookup, as you walk down
the tree, you gather up the output ranges (indices) you encounter; for
our example, each elementary range is assigned the follow range
indices as outputs:

Some ranges correspond to 1 elementary interval, while other ranges
correspond to 2 or 3 or more, in general. Some, 2 in this example,
may have no matching input ranges.

Looking up matched ranges

I've pushed all sources described below
to new Google code
project; the code is still somewhat rough and exploratory, so
there are likely exciting bugs lurking, but it does seem to work: it
includes (passing!) tests and simple micro-benchmarks.

This worked correctly, but I realized there must be non-trivial
overhead for the recursion, checking for nulls, the for loop over the
output values, etc. Next, I tried switching to parallel arrays to
hold the binary tree
(ArrayLongRangeMultiSet),
where the left child of node N is at 2*N and the right child is at
2*N+1, but this turned out to be slower.

After that I tested a code specializing implementation, first by
creating dynamic Java source code from the binary tree. This eliminates the
recursion and creates a single simple method that uses a series of
if statements, specialized to the specific ranges, to do the binary
search and record the range indices. Here's the resulting specialized
code, compiled from the above ranges:

Finally, using the ASM library, I
compiled the tree directly to
specialized Java
bytecode, and this proved to be fastest (up to 2.5X faster in some
cases).

As a baseline, I also added the simple linear
search
method, LinearLongRangeMultiSet;
as long as you don't have too many ranges (say 10 or less), its
performance is often better than the Java segment tree.

The implementation also allows you to specify the allowed range of
input values (for example, maybe all values are >=0 in your usage),
which can save an if statement or two in the lookup method.

Counting all matched ranges

While the segment tree allows you to quickly look up all matching
ranges for a given value, after a nice tip from fellow Lucene
committee Robert Muir, we realized Lucene's range faceting does not
need to know the ranges for each value; instead, it only requires the
aggregated counts for each range in the end, after seeing many values.

This leads to an optimization: compute counts for each elementary
interval and then in the end, roll up those counts to get the count
for each range. This will only work for single-valued fields, since
for a multi-valued field you'd need to carefully never increment the
same range more than once per hit.

So based on that approach, I created a
new LongRangeCounter
abstract base class, and
the SimpleLongRangeCounter
Java implementation, and also the ASM specialized version, and the
results are indeed faster (~20 to 50%) than using the lookup method to
count; I'll use this approach with Lucene.

Segment trees are normally always "perfectly" balanced but one final
twist I explored was to use a training set of values to bias
the order of the if statements. For example, if your ranges cover a
tiny portion of the search space, as is the case for
the Updated drill-down, then it should be faster to use a
slightly unbalanced tree, by first checking if the value is less
than the maximum range. However, in testing, while there are some
cases where this "training" is a bit faster, often it's slower; I'm
not sure why.

Lucene

I haven't folded this into Lucene yet, but I plan to; all the
exploratory code lives in
the segment-trees
Google code project for now.

Results on the micro-benchmarks can be entirely different once the
implementations are folded into a "real" search application.
While ASM is a powerful way to
generate specialized code, and it gives sizable performance gains at
least in the micro-benchmarks, it is an added dependency and
complexity for ongoing development and many more developers know Java than
ASM. It may also confuse hotspot,
causing deoptimizations when there are multiple implementations for an
abstract base class. Furthermore, if there are many ranges, the
resulting specialized bytecode can be become quite large (but,
still O(N*log(N)) in size), which may cause other
problems. On balance I'm not sure the sizable performance gains
(on a micro-benchmark) warrant using ASM in Lucene's range faceting.

8 comments:

I looked at RangeAccumulator (4.6.0) and I see that indeed it scans through all the ranges (I thought it breaks on the first matching range). So while it indeed supports overlapping ranges, I wonder if the accumulator was specialized to non-overlapping ranges, and breaking after the first range that matches, if we gained anything? This is also true for the interval tree right? You won't need to rollup in the end?

We could specialize and break after the first one ... I think how much that would gain would depend on the data, i.e. how often do the "early" ranges match vs later ones. For use cases like Updated, the vast majority of hits don't match any ranges.

I'm not sure that a similar "count only elementary intervals and rollup in the end" approach can work with interval trees? They seem to require that you count every matching interval for each point you look up?

I suspect the simple binary search + rollup in the end should perform best, and doesn't need to be specialized for the overlapped / non-overlapped cases. Essentially, that nice optimization reduces the overlapping case back to a non-overlapped case (plus the rollup in the end to reconstitute the "original" ranges).

A very inspiring article, Michael! Thank you for your time and sharing.I'm on the process of learning Lucene as part of getting to know the tools of my trade, and I find it fascinating, the endless possibilities and applications of the technology. Articles like the one you wrote shows a glimpse of the amazing things that can be built for enterprise software, beyond the usual and boring (after a while) Spring/JEE Stack.

Hello Michael. I am starting to use Lucene more frequently in the projects. As my use case varies from 1 point to another, I am exploring more and more of Lucene. As such, I am finding it more fascinating and have started developing a strong affinity for Lucene.

As such I want to explore the source code of Lucene, so that I can explore and contribute to the Source Code. I would like to please let me know, how do I start exploring the lucene source code. I have already downloaded the 4.6 Lucene source code. Want to know the starting point and also how I can contribute to the project...

I might sound ambitious but nonetheless, as I said, Lucene seems to be very interesting and very addictive.

You can just poke around the source code you checked out, you could have a look at the many open Jira issues (bugs & new features & wishes, etc.), here: http://jirasearch.mikemccandless.com Find something simple/small to fix first and post a patch.

Have a look at the Wiki, e.g.: http://wiki.apache.org/lucene-java/HowToContribute

And subscribe to the dev list (send any email to dev-subscribe@lucene.apache.org) and see what's being discussed.

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.