By the way: is there some equivalent to FoldString, especially "MAP_PRECOMPOSED" and "MAP_COMPOSITE"? Neither StringInfo nor TextInfo provide such a function, or?

My answer was:

The .NET Framework has something even better than FoldString here -- I'll post on it tomorrow....

But I got busy this weekend and never got around to posting the answer to the question. Sorry about that! I'll do it now (I hope Jochen did not give up on me in the interim!).

The description of FoldString from the Platform SDK: The FoldString function maps one string to another, performing a specified transformation option.

There are many different suported transformations:

MAP_FOLDCZONEFold compatibility zone characters into standard Unicode equivalents. For information about compatibility zone characters, see the following Remarks section.MAP_FOLDDIGITSMap all digits to Unicode characters 0 through 9.MAP_PRECOMPOSEDMap accented characters to precomposed characters, in which the accent and base character are combined into a single character value. This value cannot be combined with MAP_COMPOSITE.MAP_COMPOSITEMap accented characters to composite characters, in which the accent and base character are represented by two character values. This value cannot be combined with MAP_PRECOMPOSED.MAP_EXPAND_LIGATURESExpand all ligature characters so that they are represented by their two-character equivalent. For example, the ligature 'æ' expands to the two characters 'a' and 'e'. This value cannot be combined with MAP_PRECOMPOSED or MAP_COMPOSITE.

Digit folding functionality is covered by the methods I described in CharUnicodeInfo, especially GetDecimalDigitValue. Some of the other methods will do an even fuller job, supporting many of the non-decimal digit numbers, which FoldString never handled....

The ligature functionality does not really exist right now, though that does work well in comparisons, whenever it needs to.

But the other three mapping types see new life in Whidbey, with tables that cover the Unicode 4.0 version of normalization, as described in UAX #15, UNICODE NORMALIZATION FORMS.

How does it work? Well, in the Whidbey release of the .NET Framework, two new methods were added to System.String:

bool IsNormalized(NormalizationForm normalizationForm)

string Normalize(NormalizationForm normalizationForm)

The functionality of the methods is obvious enough from the names -- the first checks if the string is in a specified normalization form, and the second puts it in a specified form.

In fact the only real difference is that FoldString only does part of the job, because the FoldString tables do not have all of the mappings that are in Unicode, a point I discussed previously. But these normalization methods do. So you can do all the mapping you need to in order to take equivalent forms of the same string and put them into one consistent form.

Since the "default" method used in most situations is Form C, there are also overrides to the two methods with no NormalizationForm parameter that use Form C automatically. In many cases, that is the one you may want to use. Making Form C the "default" normalization form is not an arbitrary decision -- almost all of the keyboards in that ship in Windows input text in Form C already (though of course keyboards created by MSKLC, beng user-created, can be in whatever form).

Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5U+0068U+0302U+00b5U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5U+0068U+0302U+00b5U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

Ideally they would always compare as being equal even if the forms are different, but this is definitely not a 100% of the time result, as I pointed out a few months ago when I answered the question

Normalization and Microsoft -- whats the story?Therefore normalization is the one way you can use to make sure that you will always get the right comparison, especially in some cases that may not ever be fully supported in comparison, like "ﷺ" (U+fdfa, a.k.a. ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM), which decomposes to: