Pages

Friday, January 29, 2010

Unicode is a messy thing. You wouldn't think that something so trivial as representing all the text in all the languages in all the world, keeping track of the relationships between various related symbols and characters, and creating a clear delineation between languages and dialects would be such a hassle, but apparently it is. Pause for a second while I look for a decent way to signal my sarcasm.

Unicode text doesn't have a single format. In fact, the same string of text can be represented in a number of different normalization forms. These normalization forms can optimize for byte-width by combining character codes with modifiers like accents into a single symbol, or breaking these complex characters into a sequence of a character and modifiers. Current Unicode normalization forms are called NFC, NFKC, NFD, and NFKD.

The Parrot design documents speak of a new, special normalization form that can be used internally to Parrot to implement some of our internal string algorithms. We call this new normalization form "NFG" because it normalizes over individual graphemes. In NFG, each grapheme (which is a fancy word for a single symbol) corresponds to a single sequence of characters and composing modifers. In other normalization forms these sequences don't need to be unique.

NFG has some interesting properties that make it useful internally to Parrot as an intermediate form for string operations. Being able to convert strings from multiple encodings and charsets into a single unique character sequence could be a big help in a lot of ways. At the moment when two strings are combined together, Parrot tries to convert directly from the more restrictive encoding/charset to the less restrictive one. For N different encoding/charset combinations, we have potentially N2 such transformations. This is non-ideal.

With NFG we only need 2N transformations: One to convert each encoding/charset combo to and from NFG. Strings are converted to NFG, composed together, and then converted out into whatever format they are needed in. It's really a nice system on paper and I have high hopes that it would be a big benefit to Parrot in terms of scalability and performance.

Parrot needs two things to satisfy the letter of PDD28: A comprehensive string subsystem cleanup and refactor, and implementation of the new NFG normalization form. The later is probably more suitable for a GSoC project, while the former would be a good job for a prospective student to start looking at now to become familiarized with the system.