What are Cloudant Analizers?

Analyzers are settings which define how to recognize terms within text. This can be helpful if you need to index multiple languages.

I decided to try some of the analyzers and put my results here. Test text has two completely different languages (English and Japanese) and some other symbols which I think it is enough to find out how different analysers behave.

Experiment

We're caught in a trap\n
I can't walk out\n
Because I love you too much baby♬.\n
Sent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\n
はまった罠から\n
出られないんだ\n
ほんとに君に首ったけなんだ♬.\n
２０１６年７月７日１８時０分２９秒にnacho@email.jpより ★

I added new lines \n, ♬★ marks, an email and a timestamp to make it more interesting ;-]

Commentaries

The only notorious difference between standard and classic is the treat of "2016-07-07" and emails. standard got the emails wrong.

email looks like standard but with the emails right

whitespace looks like a good option when text has symbols (ie.: ♬, ★). There are still there as tokens!

keyword just one token. I think it would be useful for exact match searches

None of the analyzers, not even keyword preserved the new line character \n.

Several occurrences of the same string are valid. (ie.: "i" and "★")

I find funny "we're" became "we'r" when english was used. Also "baby" became "babi".

Japanese words were tokenized as characters in english. Makes really hard (if possible) to do a useful search using Japanese

I am depressed that spanish didn't get my name "nacho" right. For some reason it became "nach". I guess it is trying to get the root of it, since there is also nacha, nachito, nachita, nachos, nachas, etc.

cjk (chinese/japanese/korean) complete broke the japanese words. I know it is hard to parse Japanese but cjk is not helping at all here.

I have no idea why I tried arabic I literally have 0 knowledge of the language to comment something about it.