Monday, April 4, 2016

This March 2016 I was honored to be the author of the monthly scripting competitions at powershell.org. For the contest, I came up with a scenario where the system administrator was tasked to use PowerShell check a given path and identify all the files whose names had letters (not symbols nor numbers) in the Latin-1 Supplement character block.

This scenario came in two versions: one for beginners, where competitors where allowed to write a oneliner, and a one for experts, where I expected people to write a tool (in the form of an advanced function) to do the job.

In both cases I expected people to focus on understanding how regular expression engines use the Unicode character set and to use the best possible syntax to solve the puzzle. That's why I explicitly asked competitors to work with the Latin-1 Supplement character block. That was the key clue that should have pushed people to learn that Unicode is a so large character set that it has been split up in categories: using these categories in your regex expressions makes them more robust.

1 - OF IMPRACTICAL SOLUTIONS

Let's start looking at some sample answers we got, which is not exactly what I expected:

It's easily said that all these answers are, to varying degrees, impractical to maintain and error-prone for a simple reason: the code points used in the code are not human-readable and than a simple typo can break the code without raising alerts.

There is also a problem of subjectivity, where each competitors decided to use different code points: 00D6 or 00FF or 00F7, just to make a few examples.

So the question is how you decide which code points to use and how you could have taken advantage of Unicode categories in your regular expression to write a solid answer for this puzzle.

To answer this question I will first walk you through the Unicode model and see how it is structured.You can think of Unicode as a database maintained by an international consortium which stores all the characters in all the existing languages.

New versions are released to reflect major changes, since new writing system are discovered periodically and new glyphs (which are graphical representations of characters) have to be added: look for instance at those found on the 4 thousands years old Phaistos Disc:

2 - UNICODE VERSIONS, PLANES AND CODE POINTS

The first version of Unicode dates back to 1990 and since then a bunch of versions have followed:

1.0.0 1991 October

2.0.0 1996 July

3.0.0 1999 September

4.0.0 2003 April

5.0.0 2006 July

6.0.0 2010 October

7.0.0 2014 June

8.0.0 2015 June

The last version is 8.0 and defines a code space of 1,114,112 code points in the range 0 hex to 10FFFF hex:

(10FFFF)base16 = (1114111)base10

Concerning the Windows world, the .NET Framework 4.5 conforms to the Unicode 6.0 standard, which dates from 2010, while on previous versions, it conforms to the Unicode 5.0 standard, as you can read here.

Each code point is referred to by writing "U+" followed by its hexadecimal number, where U stands for Unicode. So U+10FFFF is the code point for the last code point in the database.

All these code points are divided into seventeen planes, each with 65,536 elements. The first three planes are named respectively:

Basic Multilingual Plane, or BMP

Supplementary Multilingual Plane, or SMP

Supplementary Ideographic Plane, or SIP

BMP, whose extent corresponds exactly to a Unsigned 16-bit integer ([uint16]::MaxValue = 65535), covers Latin, African and Asian languages as well as a good amount of symbols. So languages like English, Spanish, Italian, Russian, Greek, Ethiopic, Arabic and CJK (which stands for Chines, Japanese and Korean languages) have code points assigned in this plane.

These code points are expressed with four digits long code points, from 0000 to FFFF. So, for instance:

U+0058 is the code point for the Latin capital X

U+0389 is the code point for the Greek capital letter Omega

U+221A is the code point for the square root symbol

U+0040 is the code point for the Commercial At symbol

U+9999 is the code point for the Han character meaning 'fragrant, sweet smelling, incense'

U+0033 is the code point for the digit three

So, letters, digits and symbols we widely use have all their code point in the Unicode database.

The .NET Framework uses the System.Char structure to represent a Unicode character.

3.1 - HOW TO CONVERT A GLYPH TO A UNICODE CODE POINT

There is a simple way in Powershell to find the code point of a given glyph.

First you have to take the given character and find its numeric value, using typecasting on the fly:

$char = 'X'
[int][char]$char

This is the equivalent of the ORD function you have in many other languages (Delphi, PHP, Perl, etc).Then, using the Format operator with the X format string, convert it to hexadecimal:

'{0:X4}' -f [int][char]$char

Since each Unicode code point is referred to with a U+, we just have to add it to our string through concatenation:

'U+{0:X4}' -f [int][char]$char

3.2 - HOW TO CONVERT A UNICODE CODE POINT TO A GLYPH

Now, if you want to get the glyph of a given code point, you have to reverse your code:

First you have to ask PowerShell to call ToInt32 to convert the hex value (base-16) to a decimal:

[int][Convert]::ToInt32('0058', 16)

Then a step is required to cast the decimal to a char:

[Convert]::ToChar([int][Convert]::ToInt32('0058', 16))

So, if we go back to the examples we saw before, we can use a loop to convert all the four digits long hex values of the Basic Multilingual Plane to their corresponding glyphs.

3.3 - OF CODE POINTS BEYOND THE BASIC MULTILINGUAL PLANEAt this point it is interesting to know here that Unicode adopts UTF-16 as the standard enconding for everything inside the Basic Multilingual Plane, since, as we have seen, most living languages have all (or at least most) of their glyphs within the range 0 - 65535.For characters beyond the first Unicode plane, that is whose code is superior to 65535 and hence can't fit in a 16 bit integer (a Word), we can use two encodings: UTF-32 or 16-bits Surrogate Pairs.The latter is a method where a glyph is represented by a first (high) surrogate (16-bit long) code value in the range U+D800 to U+DBFF and a second (low) surrogate (16-bit long as well) code value in the range U+DC00 to U+DFFF. Using this mechanism, UTF-16 can support all 1,114,112 potential Unicode characters (2^16 * 17 Planes).

In any case Windows is not capable of showing non-BMP glyphs even if a font like Code2001 is installed. Let's see this in practice.In the example below I am outputting the glyph for the commercial AT (which is in the BMP) starting from its UTF-32 serialization using the ConvertFromUTF32 method:

[char]::ConvertFromUtf32(0x00000040)
@

In this other example below I am trying hard to show to screen the glyph for the MUSICAL SYMBOL G CLEF, which has been added to Unicode 3.1 and belongs to the Supplementary Multilingual Plane, but I am only able to get a square box (that is used for all characters for which the font does not have a glyph):

[char]::ConvertFromUtf32(0x0001D11E)
𝄞

Now that you are confident with code points, it is time to step up your game and get an understanding of some Unicode properties which are useful to solve our puzzle: General Category, Script and Block.

4.1 - UNICODE PROPERTIES: GENERAL CATEGORY

Each code point is kind of an object that has a property named General Category. The major categories are: Letter, Number, Mark, Punctuation, Symbol, and Other.

Within these 7 categories, there are the following subdivisions:

{L} or {Letter}

{Ll} or {Lowercase_Letter}

{Lu} or {Uppercase_Letter}

{Lt} or {Titlecase_Letter}

{L&} or {Cased_Letter}

{Lm} or {Modifier_Letter}

{Lo} or {Other_Letter}

{M} or {Mark}

{Mn} or {Non_Spacing_Mark}

{Mc} or {Spacing_Combining_Mark}

{Me} or {Enclosing_Mark}

{Z} or {Separator}

{Zs} or {Space_Separator}

{Zl} or {Line_Separator}

{Zp} or {Paragraph_Separator}

{S} or {Symbol}

{Sm} or {Math_Symbol}

{Sc} or {Currency_Symbol}

{Sk} or {Modifier_Symbol}

{So} or {Other_Symbol}

{N} or {Number}

{Nd} or {Decimal_Digit_Number}

{Nl} or {Letter_Number}

{No} or {Other_Number}

{P} or {Punctuation}

{Pd} or {Dash_Punctuation}

{Ps} or {Open_Punctuation}

{Pe} or {Close_Punctuation}

{Pi} or {Initial_Punctuation}

{Pf} or {Final_Punctuation}

{Pc} or {Connector_Punctuation}

{Po} or {Other_Punctuation}

{C} or {Other}

{Cc} or {Control}

{Cf} or {Format}

{Co} or {Private_Use}

{Cs} or {Surrogate}

{Cn} or {Unassigned}

The Char.GetUnicodeCategory and the CharUnicodeInfo.GetUnicodeCategory method are used to return the General Category property of a char.

As you can see, Unicode also brings interesting possibilities. Once you know that each Unicode character belongs to a certain category, you can try to match a single character to a category with \p (in lowercase) in your regular expression:

You can also match a single character not belonging to a category with \P (uppercase):

#X is not a digit
'X' -match "(\P{N})"
True
#3 is not a letter
3 -match "(\P{L})"
True

4.2 - UNICODE PROPERTIES: SCRIPT AND BLOCK

Other useful properties of a character are Script and Block: each character belongs to a Script and to a Block.

A Script is a group of code points defining a given human writing system, so we can generally think of a script as of a language. Though many scripts (like Cherokee, Lao or Thai) correspond to a single natural language, others (like Latin) are common to multiple languages (Italian, French, English...). Code points in a Script are scattered and don't form a contigous range.

The list of the existing Scripts is kept by the Unicode Consortium in the Unicode Character Database (UCD), which consists of a number of textual data files listing Unicode character properties and related data.

Now to see what Script a char belongs to, I simply have to find its numeric value, then see if it is in the range (converted to decimal) of code points (converted from hex to decimal) and return the Script name:

As you can see, in a few lines of code, we added to our code the ability to compare a character against a Unicode Script name, which is something that is not supported by .Net regex engine out of the box.

5.2 - HOW TO GET THE UNICODE BLOCK FOR A CHARACTER

The next step is to see how we can get which Block a given character belongs to. This is easier the getting the Script because, while .NET doesn't support regex against Script names, it natively supports running matches against Block names.

Just remember to prepend 'Is' to the Block name: not all Unicode regex engines use the same syntax to match Unicode blocks and, while Perl uses the «\p{InBlock}» syntax, .NET uses «\p{IsBlock}» instead:

Now that we are proficient with Unicode in our regexes, let's how we could have easily soved the puzzle.

I asked to detect all filenames that had letters (not symbols nor numbers) in the Latin-1 Supplement character block.

The Latin-1 Supplement is the second Unicode block in the Basic Multilingual Plane. It ranges from U+0080 (decimal 128) to U+00FF (decimal 255) and contains 64 code points in the Latin Script and 64 code points in the Common Script. Basically it contains some currency symbols (Yen, Pound), a few math signs (multiplication, division) and all lowercase and uppercase letters that have diacritics.

What's a diacritic you ask? The answer comes from Wikipedia:

Diacritic /daɪ.əˈkrɪtɪk/ – also diacritical mark, diacritical point, or diacritical sign – is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός (diakritikós, distinguishing"), which is composed of the ancient Greek διά (diá, through) and κρίνω (krínein or kríno, to separate). Diacritic is primarily an adjective, though sometimes used as a noun, whereas diacritical is only ever an adjective. Some diacritical marks, such as the acute ( ´ ) and grave ( ` ), are often called accents. Diacritical marks may appear above or below a letter, or in some other position such as within the letter or between two letters. The main use of diacritical marks in the Latin script is to change the sound-values of the letters to which they are added.

Since a Unicode Block exists listing all of the diacritical marks, they can be shown with a oneliner:

Subjectivity is gone!At the same time I did ask to include only the filenames containing Letters from that Unicode Block, not Symbols, nor Digits. Here's where the General Category property we saw above comes to the rescue. I can force the regex engine to include all letters ( \p{L} ), and exclude digits ( \P{N} ), punctuation ( \P{P} ), symbols ( \P{S} ) and separators ( \P{Z} ).

Concerning the expression, I am using here a positive lookahead assertion (?=), which is a non-consuming regular expression. I can do this as many times as I want, and this will be act as a logic "and" between the different categories I am passing to \p or \P .

7 - THE SOLUTION IS...

For sure this can be shortened to

'é' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})"

since the are no code points which are at the same time letters and numbers or letters and symbols, etc.

To sum it up, to get a list of all Latin letters with diacritics, it is as simple as typing the following line:

I hope you enjoyed this explanation. If you are a Unicode guru, and you find something incorrect, do not hesitate to drop a comment and I'll update. Thanks again to Powershell.org for giving me the occasion of being part of a larger community.