Tuesday, 8 February 2011

Sometimes we want to convert customer entered data to XML. Sometimes we want to use it for an element name. Obviously it'll need some sanitising, so what should we escape? The XML RFC is a wee bit twisty on this question, its section on Start Tags defining a Name roughly, i.e. ignoring so-called combining characters and extenders, like this:

That's similar to the definition of an identifier in many languages, but with the addition of a few specific punctuation marks. The twist comes when you consider namespaces. The XML Names recommendation states that these assign a meaning to names containing colon characters, and that therefore, authors should not use the colon in XML names except for namespace purposes. Even though XML processors must still accept the colon as a valid name character, as per the above syntax, it gives off the odour of a practice to avoid. So we go with this:

That's right. Our element names start with a letter or underscore, then continue with any number of these, possibly in combination with digits, periods, and hyphens. To put it another way (inexactly, but in practice acceptably, for my purpose): an element name is any nonempty sequence of word characters (letters, numbers, underscores), periods, and hyphens; and it must start with either a letter or an underscore.

In the interests of localization, rather than the parochial a-zA-Z_0-9, we should use the Regex word character class \w to represent, erm, word characters. That just leaves the period and hyphen to be mopped up in the main sequence. Similarly, when it comes to specifying the initial letter, rather than a-zA-Z, we should use the letter class \p{L} built for just this purpose:

A point to note about the first pattern [^-.\w] is that neither the hyphen nor the period need be escaped. Within brackets, the period represents itself, rather than being a wildcard; and the hyphen is similarly literal (as opposed to indicating a range) when it appears as the first item in a set.

Other Useful Character Classes

Why yes, there are some others, I'm glad you asked. These two are probably the droids you're looking for: \p{Lu} for uppercase letters, and \p{Ll} for their lowercase comrades. For the full story about Character Classes in C#, go to http://msdn.microsoft.com/en-us/library/20bw873z.aspx.

Homage to Science Fiction's grandmasters.John And Linda's Big French AdventuresNotes from our 2010 & 2011 Brittany holidays.So Long PCW, and belatedly, Sub SetThanks for my (rewarding, but brief) writingcareer.Sony FB: Part One : Part TwoEvil Corporation in Bait And Switch!Wee MacThe story of our Border Collie (1993-2009).What's in a Gristleizer?Life as a solder jockey; recycling Golden Virginia tobacco tins.