The code works with a stock Lucene 4.3.0 JAR and default codec, and
has a trivial API: just call NativeSearch.search instead
of IndexSearcher.search.

Now, a quick update: I've optimized PhraseQuery now as
well:

Task

QPS base

StdDev base

QPS opt

StdDev opt

% change

HighPhrase

3.5

(2.7%)

6.5

(0.4%)

1.9 X

MedPhrase

27.1

(1.4%)

51.9

(0.3%)

1.9 X

LowPhrase

7.6

(1.7%)

16.4

(0.3%)

2.2 X

~2X speedup (~90% - ~119%) is nice!

Again, it's great to see a
reduced variance on the runtimes since hotspot is mostly not an
issue. It's odd that LowPhrase gets slower QPS
than MedPhrase: these queries look mis-labelled (I
see the LowPhrase queries getting more hits than MedPhrase!).

All changes have been pushed
to lucene-c-boost;
next I'd like to figure out how to get facets working.

Suggest, sometimes called auto-suggest, type-ahead search or
auto-complete, is now an essential search feature ever since Google
added it
almost 5
years ago.

Lucene has a number of implementations;
I previously
described AnalyzingSuggester. Since
then, FuzzySuggester was also added, which extends
AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest
suggester: AnalyzingInfixSuggester, now going through
iterations on
the LUCENE-4845
Jira issue.

Unlike the existing suggesters, which generally find suggestions whose
whole prefix matches the current user input, this suggester will find
matches of tokens anywhere in the user input and in the suggestion;
this is why it has Infix in its name.

As you can see, the incoming characters can match not just the prefix
of each suggestion but also the prefix of any token within.

Unlike the existing suggesters, this new suggester does not use a
specialized data-structure such
as FSTs.
Instead, it's an "ordinary" Lucene index under-the-hood, making use
of EdgeNGramTokenFilter to index the short prefixes of
each token, up to length 3 by default, for fast prefix querying.

It also uses the
new index
sorter APIs to pre-sort all postings by suggested weight at index
time, and at lookup time uses a
custom Collector to stop after finding the first N
matching hits since these hits are the best matches when sorting by
weight. The lookup method lets you specify whether all terms must be
found, or any of the terms
(Jira search
requires all terms).

Since the suggestions are sorted solely by weight, and no other
relevance criteria, this suggester is a good fit for applications that
have a strong a-priori weighting for each suggestion, such as a movie
search engine ranking suggestions by popularity, recency or a blend, for
each movie.
In Jira search I
rank each suggestion (Jira issue) by how recently it was updated.

Specifically, there is no penalty for suggestions with matching tokens
far from the beginning, which could mean the relevance is poor in some
cases; an alternative approach (patch is on the issue) uses FSTs
instead, which can require that the matched tokens are within the
first three tokens, for example. This would also be possible
with AnalyzingInfixSuggester using an index-time analyzer
that dropped all but the first three tokens.

One nice benefit of an index-based approach
is AnalyzingInfixSuggester handles highlighting of the
matched tokens (red color, above),
which has
unfortunately proven difficult to provide with the FST-based
suggesters. Another benefit is, in theory, the suggester could
support near-real-time indexing, but I haven't exposed that in the
current patch and probably won't for some time (patches welcome!).

Performance is reasonable: somewhere
between AnalyzingSuggester
and FuzzySuggester, between 58 - 100 kQPS (details
on the issue).

Analysis fun

As
with AnalyzingSuggester, AnalyzingInfixSuggester
let's you separately configure the index-time vs. search-time
analyzers.
With Jira search, I
enabled stop-word removal at index time, but not at
search time, so that a query like or would still
successfully find any suggestions containing words starting
with or, rather than dropping the term entirely.

Which suggester should you use for your application? Impossible to
say! You'll have to test each of Lucene's offerings and pick one.
Auto-suggest is an area where one-size-does-not-fit-all, so it's great
that Lucene is picking up a number of competing implementations.
Whichever you use,
please give us
feedback so we can further iterate and improve!

Wednesday, June 19, 2013

At the end of the day, when Lucene executes a query, after the initial
setup the true hot-spot is usually rather basic code that decodes
sequential blocks of integer docIDs, term frequencies and positions,
matches them (e.g. taking union or intersection
for BooleanQuery), computes a score for each hit and
finally saves the hit if it's competitive, during collection.

Even apparently complex queries like FuzzyQuery
or WildcardQuery go through a rewrite process that
reduces them to much simpler forms like BooleanQuery.

Lucene's hot-spots are so simple that optimizing them by porting them
to native C++ (via JNI) was too tempting!

So I did just that, creating
the lucene-c-boost
github project, and the resulting speedups are exciting:

Task

QPS base

StdDev base

QPS opt

StdDev opt

% change

AndHighLow

469.2

(0.9%)

316.0

(0.7%)

0.7 X

Fuzzy1

63.0

(3.3%)

62.9

(2.0%)

1.0 X

Fuzzy2

25.8

(3.1%)

37.9

(2.3%)

1.5 X

AndHighMed

50.4

(0.7%)

110.0

(0.9%)

2.2 X

OrHighNotLow

46.8

(5.6%)

106.3

(1.3%)

2.3 X

LowTerm

298.6

(1.8%)

691.4

(3.4%)

2.3 X

OrHighNotMed

34.0

(5.3%)

89.2

(1.3%)

2.6 X

OrHighNotHigh

5.0

(5.7%)

14.2

(0.8%)

2.8 X

Wildcard

17.2

(1.2%)

51.1

(9.5%)

3.0 X

AndHighHigh

21.9

(1.0%)

69.0

(1.0%)

3.5 X

OrHighMed

18.7

(5.7%)

59.6

(1.1%)

3.2 X

OrHighHigh

6.7

(5.7%)

21.5

(0.9%)

3.2 X

OrHighLow

15.7

(5.9%)

50.8

(1.2%)

3.2 X

MedTerm

69.8

(4.2%)

243.0

(2.2%)

3.5 X

OrNotHighHigh

13.3

(5.7%)

46.7

(1.4%)

3.5 X

OrNotHighMed

26.7

(5.8%)

115.8

(2.8%)

4.3 X

HighTerm

22.4

(4.2%)

109.2

(1.4%)

4.9 X

Prefix3

10.1

(1.1%)

55.5

(3.7%)

5.5 X

OrNotHighLow

62.9

(5.5%)

351.7

(9.3%)

5.6 X

IntNRQ

5.0

(1.4%)

38.7

(2.1%)

7.8 X

These results are on the full, multi-segment Wikipedia English index
with 33.3 M documents. Besides the amazing speedups, it's also nice
to see that the variance (StdDev column) is generally lower with the
optimized C++ version, because hotspot has (mostly) been taken out of
the equation.

The API is easy to use, and works with the default codec so you won't
have to re-index just to try it out: instead
of IndexSearcher.search, call
NativeSearch.search. If the query can be optimized, it
will be; otherwise it will seamlessly fallback to
IndexSearcher.search. It's fully decoupled from Lucene
and works with the stock Lucene 4.3.0 JAR, using Java's reflection
APIs to grab the necessary bits.

This is all very new code, and I'm sure there are plenty of exciting
bugs, but (after some fun debugging!) all Lucene core tests now pass
when using NativeSearch.search.

This is not a C++ port of Lucene

This code is definitely not a general C++ port of Lucene. Rather, it
implements a very narrow set of classes, specifically the common query
types. The implementations are not general-purpose: they hardwire
(specialize) specific code, removing all abstractions like Scorer,
DocsEnum, Collector, DocValuesProducer,
etc.

There are some major restrictions on when the optimizations will
apply:

Only tested on Linux and Intel CPU so far

Requires Lucene 4.3.x

Must use NativeMMapDirectory as
your Directory implementation, which maps
entire files into RAM (avoids the chunking that the Java-based
MMapDirectory must do)

Must use the default codec

Only sort by score is supported

None of the optimized implementations use advance:
first, this code is rather complex and will be quite a bit of work to port
to C++, and second, queries that benefit from advance are generally
quite fast already so we may as well leave them in Java

BooleanQuery is optimized, but only when all clauses
are TermQuery against the same field.

C++ is not faster than java!

Not necessarily, anyway: before anyone goes off screaming how these
results "prove" Java is so much slower than C++, remember that this is
far from a "pure" C++ vs Java test. There are at least these three
separate changes mixed in:

Algorithmic changes. For example, lucene-c-boost sometimes uses
BooleanScorer where Lucene is
using BooleanScorer2. Really we need to fix Lucene
to do similar algorithmic changes (when they are faster). In
particular, all of the OrXX queries that include
a Not clause, as well as
IntNRQ in the above results, benefit from
algorithmic changes.

It's not at all clear how much of the gains are due to which part;
really I need to create the "matching" specialized Java sources to do
a more pure test.

This code is dangerous!

Specifically, whenever native C++ code is embedded in Java, there is
always the risk of all those fun problems with C++ that we Java
developers thought we left behind. For example, if there are bugs
(likely!), or even innocent API mis-use by the application such as
accidentally closing an IndexReader while other threads
are still using it, the process will hit
a Segmentation
Fault and the OS will destroy the JVM. There may also be memory
leaks! And, yes, the C++ sources even use
the goto statement.

Work in progress...

This is a work in progress and there are still many ideas to explore.
For example, Lucene 4.3.x's default
PostingsFormat stores big-endian longs, which means the
little-endian Intel CPU must do byte-swapping when decoding each
postings block, so one thing to try is a PostingsFormat
better optimized for the CPU at search time. Positional queries,
Filters and nested BooleanQuery are not yet optimized, as well as
certain configurations (e.g., fields that omit norms). Patches
welcome!

Nevertheless, initial results are very promising, and if you are
willing to risk the dangers in exchange for massive speedups please
give it a whirl and report back.

Sunday, June 9, 2013

Then today is your lucky day! I just built
a simple web
application that creates an FST from the input/output strings that
you enter.

If you just want a finite state automaton (no outputs) then enter only
inputs, such
as this example:

If all of your outputs
are non-negative
integers then the FST will use numeric outputs, where you sum up
the outputs as you traverse a path to get the final output:

Finally, if the outputs are non-numeric then they are treated as
strings, in which case you concatenate as you traverse the path:

The red arcs are the ones with the NEXT optimization: these arcs do
not store a pointer to a node because their to-node is the very next
node in the FST. This is a good optimization: it generally results in
a
large reduction of the FST size. The bolded arcs tell you the
next node is final; this is most interesting when a prefix of another
input is accepted, such as this example:

Here the "r" arc is bolded, telling you that "star" is accepted.
Furthermore, that node following the "r" arc has a final output,
telling you the overall output for "star" is "abc".

The web app is a simple Python WSGI app; source code
is here.
It invokes a simple Java tool as a subprocess; source code (including
generics violations!)
is here.

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.