Analyzers are used for creating fulltext-indexes. They take the content of a
field and split it into tokens, which are then searched. Analyzers filter,
reorder and/or transform the content of a field before it becomes the final
stream of tokens.

An analyzer consists of one tokenizer, zero or more token-filters, and zero or
more char-filters.

When a field-content is analyzed to become a stream of tokens, the char-filter
is applied at first. It is used to filter some special chars from the stream of
characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them.

With these elements in place, analyzer provide fine-grained control over
building a token stream used for fulltext search. For example you can use
language specific analyzers, tokenizers and token-filters to get proper search
results for data provided in a certain language.

Below the builtin analyzers, tokenizers, token-filters and char-filters are
listed. They can be used as is or can be extended.

The fingerprint analyzer implements a fingerprinting algorithm which is used by
the OpenRefine project to assist in clustering. Input text is lowercased,
normalized to remove extended characters, sorted, deduplicated and concatenated
into a single token. If a stopword list is configured, stop words will also be
removed. It uses the standard Tokenizer and the following
filters: lowercase, asciifolding,
fingerprint and ref:stop-tokenfilter.

A tokenizer of type standard providing a grammar based tokenizer, which is a
good tokenizer for most European language documents. The tokenizer implements
the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex
#29.

This tokenizer is a grammar based tokenizer that is good for English language
documents. This tokenizer has heuristics for special treatment of acronyms,
company names, email addresses, and internet host names. However, these rules
don’t always work, and the tokenizer doesn’t work well for most languages other
than English.

The input to the stemming filter must already be in lower case, so you will
need to use Lower Case Token Filter or Lower Case Tokenizer farther down
the Tokenizer chain in order for this to work properly! For example, when
using custom analyzer, make sure the lowercase filter comes before the
porterStem filter in the list of filters.

Basic support for Hunspell stemming. Hunspell dictionaries will be picked up
from the dedicated directory <path.conf>/hunspell. Each dictionary is
expected to have its own directory named after its associated locale
(language). This dictionary directory is expected to hold both the *.aff and
*.dic files (all of which will automatically be picked up).

Split tokens up by delimiter (default |) into the real token being indexed
and the payload stored additionally into the index. For example
Trillian|65535 will be indexed as Trillian with 65535 as payload.

Whether or not to fill empty buckets with the value of the first non-empty
bucket to its circular right. Only takes effect if hash_set_size is equal
to one. Defaults to true if bucket_count is greater than 1, else
false.

Emits a single token which is useful for fingerprinting a body of text, and/or
providing a token that can be clustered on. It does this by sorting the
tokens, deduplicating and then concatenating them back into a single token.