Blog dedicated to Elasticsearch Server Books series

ElasticSearch 0.90 – Similarities

Send to Kindle

The next functionality that ElasticSearch 0.90 we would like to discuss is again bound to what Lucene 4.0 introduced – the changes in the API of the classes responsible for scoring formula. In addition to changed API Apache Lucene 4.0 introduced a few relevance calculation formulas that are available out of the box for its users. Also, starting from ElasticSearch 0.90.0.Beta1 we were given the possibilities of using those new scoring formulas.

Introduction

We won’t be talking about the API changes, because from the user perspective it doesn’t matter until you want to develop your own, custom Similarity class. Apart from scripts we didn’t talk about writing custom plug-ins for ElasticSearch and we will stick with that, at least for now. Let’s focus on what we have available from ElasticSearch user perspective.

Introduced Similarities

Apache Lucene 4.0 (and thus 4.1 on which ElasticSearch 0.90.0.Beta1 is based on) introduced the following similarities implementations:

BM25 – Similarity class based on a probabilistic model, that estimates the probability of finding a document for a given query. More information about this Similarity and the maths standing by it can be found on Wikipedia page dedicated to it (http://en.wikipedia.org/wiki/Okapi_BM25). This Similarity can be used in ElasticSearch by using the BM25 name.

I know, too much, not so clear information, but I promise I’ll stop now and get to the point – show you how to use those in ElasticSearch.

Similarities Available in ElasticSearch

All the above mentioned Similarities are available in ElasticSearch, however some of them require some additional configuration to be present. The TF/IDF and the BM25 similarities can be used without any additional configuration, just by adding them to your field definition. The ones that require additional configuration are the last two onces – the DFR and IB similarities. We will show you how to configure both of them in the end of this post.

Mappings Once Again

Before continuing let’s recall the mappings that were present in the first chapter of the book once again. So, the mappings for the post type were as follows:

Specifying Similarity on per-field Basis

In order to tell ElasticSearch that we want to use other than the default TF/IDF similarity we need to add the similarity property to our field definition. So, if we would like to use the BM25 similarity for our name field, we would have the following field definition:

Our configured DFR similarity will be available to use under the name esserverbook_dfr_similarity. The possible options for basic_model property are: be, d, g, if, in, ine and p. The possible options for after_effectproperty are no, b and l. The normalization can be no, h1, h2,h3 or z. In addition to that for the h1 normalization we can specify the normalization.h1.c property, for the h2 we can specify the normalization.h2.c, for h3 we can specify the normalization.h3.c property and for the z normalization we can specify the normalization.z.z property.

IB Similarity

Now let’s have a look at the IB similarity configuration. In the same manner we did with DFR similarity configuration what we need to do is add the similarity section containing our similarity configuration to the index settings:

Our configured IB similarity will be available to use under the name esserverbook_ib_similarity. The possible options for distribution property are: ll and spl. The possible options for lambdaproperty are df and ttf. The normalization can be no, h1, h2,h3 or z. Identically to the DFR normalization for the h1 normalization we can specify the normalization.h1.c property, for the h2 we can specify the normalization.h2.c, for h3 we can specify the normalization.h3.c property and for the z normalization we can specify the normalization.z.z property