"tokenize" is the process that extracts zero or more tokens from a
text. There are some "tokenize" methods.

For example, HelloWorld is tokenized to the following tokens by
bigram tokenize method:

He

el

ll

lo

o_ (_ means a white-space)

_W (_ means a white-space)

Wo

or

rl

ld

In the above example, 10 tokens are extracted from one text HelloWorld.

For example, HelloWorld is tokenized to the following tokens by
white-space-separate tokenize method:

Hello

World

In the above example, 2 tokens are extracted from one text HelloWorld.

Token is used as search key. You can find indexed documents only by
tokens that are extracted by used tokenize method. For example, you
can find HelloWorld by ll with bigram tokenize method but you
can't find HelloWorld by ll with white-space-separate tokenize
method. Because white-space-separate tokenize method doesn't extract
ll token. It just extracts Hello and World tokens.

For example, we can find HelloWorld and AorB by or with
bigram tokenize method. HelloWorld is a noise for people who
wants to search "logical and". It means that precision is
decreased. But recall is increased.

We can find only AorB by or with white-space-separate
tokenize method. Because World is tokenized to one token World
with white-space-separate tokenize method. It means that precision is
increased for people who wants to search "logical and". But recall is
decreased because HelloWorld that contains or isn't found.

TokenBigram is a bigram based tokenizer. It's recommended to use
this tokenizer for most cases.

Bigram tokenize method tokenizes a text to two adjacent characters
tokens. For example, Hello is tokenized to the following tokens:

He

el

ll

lo

Bigram tokenize method is good for recall because you can find all
texts by query consists of two or more characters.

In general, you can't find all texts by query consists of one
character because one character token doesn't exist. But you can find
all texts by query consists of one character in Groonga. Because
Groonga find tokens that start with query by predictive search. For
example, Groonga can find ll and lo tokens by l query.

Bigram tokenize method isn't good for precision because you can find
texts that includes query in word. For example, you can find world
by or. This is more sensitive for ASCII only languages rather than
non-ASCII languages. TokenBigram has solution for this problem
described in the below.

TokenBigram behavior is different when it's worked with any
Normalizers.

If no normalizer is used, TokenBigram uses pure bigram (all tokens
except the last token have two characters) tokenize method:

TokenBigramSplitSymbolAlpha is similar to TokenBigram. The
difference between them is symbol and alphabet
handling. TokenBigramSplitSymbolAlpha tokenizes symbols and
alphabets by bigram tokenize method:

TokenBigramSplitSymbolAlphaDigit is similar to
TokenBigram. The difference between them is symbol, alphabet
and digit handling. TokenBigramSplitSymbolAlphaDigit tokenizes
symbols, alphabets and digits by bigram tokenize method. It means that
all characters are tokenized by bigram tokenize method: