Regular Expression Inconsistencies With Unicode

A casual stroll through the world of Unicode and regular expressions—​Photo by Presidio of Monterey

Character classes in regular expressions are an extremely useful and widespread feature, but there are some relatively recent changes that you might not know of.

The issue stems from how different programming languages, locales, and character encodings treat predefined character classes. Take, for example, the expression \w which was introduced in Perl around the year 1990 (along with \d and \s and their inverted sets \W, \D, and \S).

The \w shorthand is a character class that matches “word characters” as the C language understands them: [a-zA-Z0-9_]. At least when ASCII was the main player in the character encoding scene that simple fact was true. With the standardization of Unicode and UTF-8, the meaning of \w has become a more foggy.

Beginning with Perl 5.12 from the year 2010, character classes are handled differently. Documentation on the topic is found in perlrecharclass. The rules aren’t as simple as with some languages, but can be generalized as such:

\w will match Unicode characters with the “Word” property (equivalent to \p{Word}), unless the /a (ASCII) flag is enabled, in which case it will be equivalent to the original [a-zA-Z0-9_].

However, you should know that for code points below 256, these rules can change depending on whether Unicode or locale rules are on, so if you’re unsure, consult the perlre and perlrecharclass.

Keep in mind that these same questions of what the character classes include can apply to every predefined character class in whatever language you’re using, so remember to check language-specific implementations for other character class shorthands, such as \s and \d, not just \w.

Every language seems to do regular expressions a little bit differently, so here’s a short, incomplete guide for several other languages we use frequently.

Python is also treating \w differently here. Let’s take a look at the Python docs:

\w

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

So \w includes “most characters that can be part of a word in any language, as well as numbers and the underscore”. A list of the characters that includes is difficult to pin down, so it would be best to use the re.ASCII flag as suggested when you’re unsure if you want letters from other languages matched:

The excellent article "💩".length === 2 by Jonathan New goes into detail about the why this is, and explores various solutions. It also addresses some legacy inconsistencies, like how the old HEAVY BLACK HEART character and other older Unicode symbols might be represented differently.

PHP

PHP’s documentation explains that \w matches letters, digits, and the underscore as defined by your locale. It’s not totally clear about how Unicode is treated, but it uses the PCRE (Perl Compatible Regular Expressions) library which supports a /u flag that can be used to enable Unicode matching in character classes:

.NET

The .NET Quick Reference has a comprehensive guide to character classes. For word characters, it defines a specific group of Unicode categories including letters, modifiers, and connectors from many languages, but also points out that setting the ECMAScript Matching Behavior option will limit \w to [a-zA-Z_0-9], among other things. Microsoft’s documentation is clear and comprehensive with great examples, so I recommend referring to it frequently.

Go

Go follows the regular expression syntax used by Google’s RE2 engine, which has easy syntax for specifying whether you want Unicode characters to be captured or not:

grep

Implementations of grep vary widely across platforms and versions. On my personal computer with GNU grep 3.1, \w doesn't work at all with default settings, matches only ASCII characters with the -P (PCRE) option, and matches Unicode characters with -E:

Again, implementations vary a lot, so double check on your system before doing anything important.

Other links

As great as Unicode and regular expressions are, their implementations vary widely across various languages and tools, and that introduces far more unexpected behavior than I can write about in this post. Whenever you're going to use something with Unicode and regular expressions, make sure to check language specifications to make sure everything will work as expected.

Of course, this topic has already been discussed and written about at great length. Here are some links worth checking out:

ftfy for Python - ftfy is a Python library that takes corrupt Unicode text and attempts to fix it as best it can. I haven’t yet had a chance to use it, but the examples are compelling and it’s definitely worth knowing about.