This Week in Elasticsearch and Apache Lucene - 2017-02-06

Welcome to
This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

The new unified highlighter solves many of the problems that existed with previous highlighters, including accounting for gaps created by stopword filters. The highlighter can either analyze text directly (plain mode), use postings offsets (postings mode), or use term vectors (fvh mode). This should be your new go-to highlighter, although it is still missing a few features supported by other highlighters, such as an upper limit on fragment size, the ability to highlight a single field based on matches in multiple fields (matched_fields), and the ability to collapse contiguous highlights.

Groundwork laid for operation-based recovery

The team has been hard at work on the sequence numbers project and the first high level feature has landed last week. Sequence numbers are now used for operation based recovery (as opposed to file based sync). With operation based recovery a primary can bring a replica up to speed by only streaming the operations that happened while the replica was offline. This is a great advantage compared to file base syncing which potentially requires copying gigabytes of data. Operation based recovery is only done if the relevant operations can be found in the primary translog. At the moment the chances of this happening are small and in practice this will only happen when a replica was temporary off line. Future work will make this more likely to a point where op based recovery becomes the standard.

Lucene discussing optimisations for leading wildcards

Fast infix searching (*abc*) or even just leading wildcard searching (*abc) has come up on Lucene's users list in the past, but has never been implemented, in part because it's seen of an abuse of a search engine: really, you should do a good job tokenizing during indexing up front so that you don't need such costly sub-token operations at search time. But it's also partly because nobody had enough of an itch to actually do the work, that is until now! The initial patch on the issue is too invasive and very heap heavy (using suffix arrays) and isn't using the most efficient (yet, complex) known approach for building suffix arrays, yet the subsequent discussion is a nice demonstration of how healthy open source iterations unfold: one response is to fold the approach into a custom PostingsFormat, while another is to use FSTs to reduce heavy heap usage. It's not clear how the issue will finish but it's possible Lucene will soon offer a better solution for infix searches.