Elasticsearch basics - Analyzers

July 24, 2013

Elasticsearch is a powerful open source search engine build over Apache
Lucene. You can do all kind of customized searches on huge amount of
data by creating customized indexes. This post gives an overview of
Analysis module of elasticsearch.

Analyzers basically helps you in analyzing your data.:o You need to analyze data while creating indexes and while searching. You could analyze your analyzers using Analyze Api provided by elasticsearch.

Creating indexes mainly involves three steps:

Pre-processing of raw text using char filters. This may be used to strip html tags, or you may define your custom mapping. (Couldn’t find a way to test this using analyse api. Please put it in comments if you know some way to test these through Analyze Api)

Example: You could use a char-filter of type html_strip to strip out html tags.

A text like this:

<p> Learn Something New Today! which is <b>always</b> fun </p>

would get converted to:

Learn Something New Today! which is always fun

Tokenization of the pre-processed text using tokenizers. Tokenizers breaks the pre-processed text into tokens. There are different kind of tokenizers available and each of them breaks the text into words differently. By default elasticsearch uses standard tokenizer.

standard tokenizer normalizes the data. Note that it removes ! from Today!

A pre-processed text like this:

Learn Something New Today! which is always fun

gets broken as

LearnSomethingNewTodaywhichisalwaysfun

You could check for yourself using Analyze Api mentioned above.

curl -XGET'localhost:9200/_analyze?tokenizer=standard'\-d'Learn Something New Today! which is always fun'