Paraesthesia: .NET Development and Some Pictures of My Cat

JavaScript and Unicode Character Validation

I’m struggling right now with the fact that JavaScript/ECMAScript
doesn’t allow for Unicode character classes in regular expressions. For
example, if I want to set up a client-side JavaScript validation
expression on a numeric field, I’d want to do something like ^\d+$ as
my regular expression, right? Match one or more digits?

The problem is that in JavaScript, \d expands out to [0-9], which
technically isn’t all of the digits, if you think about all of the other
alphabets out there that exist and don’t use 0 through 9 to indicate
numbers.

In .NET, they solve this by mapping to Unicode character classes. So
\d maps to \p{Nd}, which is the Unicode character class for digits.
Much more global, right? So how do you do that on the client side?

Well, I figure you have to expand the character classes on the server
side and then feed those to the client. JavaScript supports Unicode
character codes with a hexadecimal character code, so you can say like
\uFFFF or whatever to specify a particular character. So you need to
take \d and expand to the full set of Unicode characters.

Using \d as our example, a C# snippet that expands the digits looks
like this:

staticvoidMain(string[]args){stringNd=UnicodeExpansion(System.Globalization.UnicodeCategory.DecimalDigitNumber);Console.WriteLine(Nd);Console.ReadLine();}/// <summary>
/// Expands a Unicode character set into an ECMAScript compatible character
/// range string.
/// </summary>
/// <param name="category">
/// The Unicode character category to expand.
/// </param>
/// <returns>
/// A <see cref="System.String" /> that can be used in an ECMAScript regular
/// expression.
/// </returns>
/// <remarks>
/// <para>
/// ECMAScript (JavaScript) does not inherently understand Unicode in regular
/// expressions, which results in incorrect validation when using character
/// classes (\w, \s, \d, etc.).
/// </para>
/// <para>
/// This method expands a <see cref="System.Globalization.UnicodeCategory" />
/// into a string that can be used in an ECMAScript regular expression. For
/// example, the category <see cref="System.Globalization.UnicodeCategory.LetterNumber" />
/// expands to <c>\u2160-\u2183\u3007\u3021-\u3029\u3038-\u303a</c>.
/// </para>
/// </remarks>
publicstaticstringUnicodeExpansion(System.Globalization.UnicodeCategorycategory){// The fully expanded block of characters
stringexpansion="";// Low-end of the character block
intblockLow=-1;// High-end of the character block
intblockHigh=-1;// Marks whether the current block has been written
boolblockWritten=false;for(intcharVal=0;charVal<=Char.MaxValue;charVal++){// Get the category of the current character
System.Globalization.UnicodeCategorycharCat=Char.GetUnicodeCategory(Convert.ToChar(charVal));// We haven't written anything this loop; used to ensure
// all blocks get written at the end.
blockWritten=false;// Ignore characters that don't match the category.
if(charCat!=category){continue;}if(blockLow==-1){// Handle the very first block
blockLow=charVal;blockHigh=charVal;}elseif(// charVal skipped some characters OR
blockHigh+1!=charVal||// We're at the end of the set of characters
blockHigh+1>Char.MaxValue){// Write the block to the expansion string
if(blockLow==blockHigh){// This is a one-character block
expansion+=String.Format(@"\u{0:x4}",blockLow);}else{// This is a multi-char block
expansion+=String.Format(@"\u{0:x4}-\u{1:x4}",blockLow,blockHigh);}// Start a new block
blockWritten=true;blockLow=charVal;blockHigh=charVal;}else{// We're still in the same block; increment the high end of the block.
blockHigh=charVal;}}// If we didn't write the last block, write it now
if(!blockWritten){if(blockLow==blockHigh){// This is a one-character block
expansion+=String.Format(@"\u{0:x4}",blockLow);}else{// This is a multi-char block
expansion+=String.Format(@"\u{0:x4}-\u{1:x4}",blockLow,blockHigh);}blockWritten=true;}returnexpansion;}

Which means that rather than ^[\d]+$ to validate, you’d use
^[\u0030-\u0039\u0660-\u0669\u06f0-\u06f9\u0966-\u096f\u09e6-\u09ef\u0a66-\u0a6f\u0ae6-\u0aef\u0b66-\u0b6f\u0be7-\u0bef\u0c66-\u0c6f\u0ce6-\u0cef\u0d66-\u0d6f\u0e50-\u0e59\u0ed0-\u0ed9\u0f20-\u0f29\u1040-\u1049\u1369-\u1371\u17e0-\u17e9\u1810-\u1819\uff10-\uff19]+$.

I’m using numbers as my example here, though the same thoughts could be
applied to letters or any other character classes. Like in JavaScript,
\w maps to [a-zA-Z_0-9], which is obviously not all the possible
letters out there.

You could even take this a further step and pre-calculate all of the
Unicode character blocks at application start time and cache the common
character class expansions for use in regex translation on the server
side.

Updated 9/9/2005 for boundary condition logic error and again on
9/11/2005 to fix accidental omission of the last block (thanks cougio);
modified the method to be a standalone static for easier cut and paste
into applications; added comments for readability.