Can you dig it?

Understanding ElasticSearch Analyzers

Sep 14th, 2013

Sadly, lots of early Internet beer recipes aren’t necessarily in an easily digestible format; that is, these recipes are unstructured intermixed lists of directions and ingredients often originally composed in an email or forum post.

So while it’s hard to easily put these recipes into traditional data stores (ostensibly for easier searching), they’re perfect for ElasticSearch in their current form.

Note how the interesting part of a recipe JSON document, dubbed beer_1 is found in the ingredients field. This field is basically a big string of valuable text (you can imagine how this string was essentially the body of an email). So while the ingredients field is unstructured, it’s something clearly that people will want to search on.

It’s a hot summers day and I’m thinking I’d like to make a beer with lemon as an ingredient (to be clear: I want to use lemon zest, which is obtained from a lemon peel). So naturally, I need to find (i.e. search for) a recipe with lemons in it.

If you look closely in the earlier code example (specifically, the beer_1 JSON document), you can see that the word “lemons” is in the text (i.e. “…two table oranges and two lemons…”). It turns out, by default, the way values are indexed by ElasticSearch, lemon doesn’t necessarily match – lemons does though.

Lo and behold, this search returns a hit! But that’s inconvenient, to say the least. Basically the words in the ingredients field are tokenized as is. Hence, a search for “lemons” works while “lemon” doesn’t. Note: there are various mechanisms for searching, and a search on “lemon*” should have returned a result.

When a document is added into an ElasticSearch index, its fields are analyzed and converted into tokens. When you execute a search against an index, you search against those tokens. How ElasticSearch tokenizes a document is configurable.

There are different ElasticSearch analyzers available – from language analyzers that allow you to support non-English language searches to the snowball analyzer, which converts a word into its root (or stem and that process of creating a stem from a word is called stemming), yielding a simpler token. For example, a snowball of “lemons” would be “lemon”. Or if the words “knocks” and “knocking” were in a snowball analyzed document, both terms would have “knock” as a stem.

You can change how documents are tokenized via the index mapping API like so:

Note how the above mapping specifies that the ingredients field will be analyzed via the snowball analyzer. Also note, you have to change the mapping of an index before you begin to add documents to it! So, in this case, I’ll need to drop the index, run the mapping call above, and then re-add those two recipes.

Now I can begin searching recipes for the ingredient “lemon” or “lemons”.

Keep in mind that snowballing can inadvertently make your search results less relevant. Long words can be stemmed into more common but completely different words. For example, if you snowball a document that contains the word “sextant”, the word “sex” will result as a stem. Thus, searches for “sextant” will also return documents that contain the word “sex” (and vice versa).

ElasticSearch puts a powerful search engine into your clutches; plus, with a little forethought into how a document’s contents are analyzed, you’ll make searches event more relevant.