Custom analyzer for fulltext search in Neo4j

We have already blogged about fulltext search available in Neo4j 3.5.
The list of available analyzers covers many languages and fits various use cases.
However once you expose the search to real users they will start pointing out edge cases and complain about the search not being google-like.

Speakers of languages using accents in their written form quite often leave out the accents.
This has various reasons, the most common ones are

historical, when different character encodings caused problems and users find it hard to change their habits

using a different default keyboard layout (e.g. en_US); switching the layout just for a search keyword is annoying

the accented letters are in the top keyboard row and are slightly harder to reach,
reducing WPM/CPM (words per minute, characters per minute)

A common complaint among such users is that the search doesn’t ignore the accents.
Let’s look at an example with Czech names and provided Czech analyzer. We will create some sample data and a fulltext index for name property.

What is missing is a step which would remove the accents.
Lucene already provides classes for this step, such as ASCIIFoldingFilter or ICUFoldingFilter (from lucene-analyzers-icu package).
Because the CzechStemFilter expects the tokens with accents, we will add the filter as the last step.
The new custom CzechAnalyzer will look as follows:

Conclusion

The modification of the CzechAnalyzer was rather simple, but the approach can be used to leverage a wide range of use cases.
You can checkout the whole example project on github
or drop us line if you need help with more sophisticated requirements.