Elasticsearch Tokenizers – Word Oriented Tokenizers

A tokenizer breaks a stream of characters up into individual tokens (characters, words…), then outputs a stream of tokens. We can also use tokenizer to record the order or position of each term (for phrase and word proximity queries), or the start and end character offsets of the original word which the term represents (for highlighting search snippets).

In this tutorial, we’re gonna look at how to use some Word Oriented Tokenizers which tokenize full text into individual words.

Max Token Length

We can configure maximum token length (max_token_length – Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.
For example, we set max_token_length to 4, it makes QUICK separate to QUIC and K.

3. Letter Tokenizer

letter tokenizer breaks text into terms whenever it encounters a character which is NOT a letter.
For most European languages, it’s so good, but for some Asian languages, it becomes terrible because many words are not separated by spaces.

For example:

1

2

3

4

5

POST _analyze

{

"tokenizer": "letter",

"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."

}

Term: dog's will be separated into dog and s:

1

[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

4. Lowercase Tokenizer

lowercase tokenizer, like the letter tokenizer, but it also lowercases all terms:

1

2

3

4

5

POST _analyze

{

"tokenizer": "lowercase",

"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."

}

Term:

1

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

5. UAX URL Email Tokenizer

uax_url_email tokenizer is like the standard tokenizer, but it recognises URLs and email as single tokens.

1

2

3

4

5

POST _analyze

{

"tokenizer": "uax_url_email",

"text": "Contact us via contact@grokonez.com"

}

Term:

1

[ Contact, us, via, contact@grokonez.com ]

Max Token Length

We can configure maximum token length (max_token_length – Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

PUT jsa_index_max_length_test

{

"settings": {

"analysis": {

"analyzer": {

"jsa_analyzer": {

"tokenizer": "jsa_tokenizer"

}

},

"tokenizer": {

"jsa_tokenizer": {

"type": "standard",

"max_token_length": 7

}

}

}

}

}

POST jsa_index_max_length_test/_analyze

{

"analyzer": "jsa_analyzer",

"text": "Contact us via contact@grokonez.com"

}

Term:

1

[ Contact, us, via, contact, javasam, pleappr, oach.co, m ]

6. Classic Tokenizer

classic tokenizer is good for English language documents. This tokenizer has heuristics for special treatment of acronyms, company names, email addresses, and internet host names:

– It splits words at most punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token (jsa. com -> [ jsa, com ], jsa.com -> [ jsa.com ])
– It splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split (Java-Sample -> [ Java, Sample ], Java9-Sample -> [ Java9-Sample ])
– It recognizes email addresses and internet hostnames as one token (like uax_url_email)

However, the rules doesn’t work well for most languages other than English.

Max Token Length

We can configure maximum token length (max_token_length – Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.

grokonez

ABOUT US
We are passionate engineers in software development by Java Technology & Spring Framework. We believe that creating little good thing with specific orientation everyday can make great influence on the world someday.