Full functionality of this feature is available only in midPoint 3.8 and later. PolyString normalization was available in midPoint almost since the beginning, but it was not configurable.

Introduction

PolyString is a flexible data structure that has several purposes. One of the purposes is to allow comfortable text search in international environment. The basic idea is that is the user is searching for user naiveboy123 then the users naïveBOY123 and NaiveBoy123 are found. The other effect is that if someone already registered naiveboy123 username then other user cannot register naïveBOY123, NaiveBoy123 and NAIVEboy123 usernames.

This functionality is sometimes achieved by using a native full text search capabilities of the database. However, those database features are often difficult to configure and they may also be somehow expensive. But most importantly of all they are specific to individual databases, there is no practical standard. As midPoint supports several databases it would be very difficult to support those capabilities in all the databases. Therefore midPoint is using a much simpler approach.

PolyString is storing the value in two different forms:

Original form (orig): the text that was entered by the user. This may contain international characters, any number of whitespace and so on. E.g. "Coup D'état".

Normalized form (norm): the text that was simplified, cleaned up, transformed to canonical form or otherwise prepared for storage. E.g. "coup detat".

Both orig and norm forms are stored in the database. The orig form is used for vast majority of purposes: displaying the value, editing the value and so on. However, when it comes to searching or uniqueness check then the norm value is used. Searching works like this:

Careless user enters " Coup d`etat " into search input field.

MidPoint normalizes the value. The result is "coup detat".

MidPoint looks in the database for all entries that have norm value equals to "coup detat".

An entry is found. The orig part the matching entry is displayed: "Coup D'état".

Of course, this capability is a bit limited. Searching for "coup d etat" is unlikely to provide any meaningful results. There is no equivalence nor stemming, therefore it makes no sense to search for "putsch". But this is nice, simple and elegant method for many practical use cases.

Normalizers

The effectiveness of the method depends heavily on the normalization algorithm. Given the right algorithm and the system will work flawlessly. However wrong normalization algorithm may cause a lot of problems. Therefore since midPoint 3.8 the normalization algorithm is configurable and there is an option for a completely custom normalization algorithm.

All normalization algorithms bundled with midPoint go through the same set of steps:

Trimming (trim): removing whitespaces at the start and the end of the string.

Decomposition (nfkd): composed characters (such as é) are decomposed to the constituent parts (e and '). Unicode Normalization Form Compatibility Decomposition (NFKD) is used for this purpose.

Lowercase transform (lowercase): all characters are transformed to their lowercase equivalents.

All these steps are applied by default. But individual steps can be disabled in the normalizer configuration, therefore the function of the normalizer can be customized. There are also three options for the core normalization algorithm:

Normalizer class

Description

Example transforms (default configration)

AlphanumericPolyStringNormalizer(default)

Keeps only (latin) alphanumeric characters.Due to NFKD decomposition the composed national characters will be converted to base latin characters.

Normalizer Configuration

Individual processing steps can be turned off by setting correspondig elements (trim, nfkd, trimWhitespace, lowercase) to false. Normalizer can be specified by placing its class name to a className element. The className element may also contain fully-qualified class name of a custom normalizer code (Note: this functionality is EXPERIMENTAL).

Normalizers are initialized at system startup. The mechanism for handling change of normalizer configuration in runtime is very limited, therefore all midPoint nodes must be restarted if normalizer configuration is changed. Also, the normalizer reconfiguration affect only new values that are updated after configuration change. Existing values in the repository are unaffected. Therefore for the change to take a full effect all the data need to be updated (e.g. export and re-import of the data).