Properly “internationalized” regular expressions in R

We should pay special attention to writing a truly portable code that works in the same fashion under different locales and character encodings. Currently, R has two Regex engines, ERE (via TRE) and PRE (via PCRE). What is surprising, they ought to give different results on different operating systems and native character encodings used!

UPDATE@2013-07-10: check out our stringi package to get rid of such problems forever!

PCRE often outperforms ERE and has a more powerful syntax. Moreover, it was built into R with Unicode support. As UTF-8 may represent almost all printable characters used around the world, a good idea is to always use PRE on normalized character vectors, i.e. converted from native encoding to UTF-8 via enc2utf8() and then, after regexing, back with enc2native().

Here’s an example on matching some character classes in three different locales. The string where matches were sought consisted of all ASCII characters (codes 1–127) and Polish letters (ą, ę, ł, ś, ż, and so on).

Locale

Pattern

pl_PL.UTF-8 (GNU/Linux)

pl_PL.iso-8859-2 (GNU/Linux)

Polish_Poland.1250 (Windows)

ERE-Native

[[:alpha:]]

AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż

AB...Zab...zĆĘŃÓćęńó

[[:digit:]]

0123456789

0123456789ął

[[:lower:]]

ab...ząćęłńóśźż

ab...zćęńó

[[:upper:]]

AB...ZĄĆĘŁŃÓŚŹŻ

AB...ZĆĘŃÓ

[[:punct:]]

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŚŹŻąłśźż

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŻąłż

[A-Z]

AB...Z

[a-z]

ab...z

PCRE-Native

[[:alpha:]]

AB...Zab...z

AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż

[[:digit:]]

0123456789

[[:lower:]]

ab...z

ab...ząćęłńóśźż

[[:upper:]]

AB...Z

AB...ZĄĆĘŁŃÓŚŹŻ

[[:punct:]]

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

[A-Z]

AB...Z

[a-z]

ab...z

ERE-UTF-8 normalized

[[:alpha:]]

AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż

AB...Zab...zÓó

[[:digit:]]

0123456789

[[:lower:]]

ab...ząćęłńóśźż

ab...zó

[[:upper:]]

AB...ZĄĆĘŁŃÓŚŹŻ

AB...ZÓ

[[:punct:]]

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

[A-Z]

AB...Z

[a-z]

ab...z

PCRE-UTF-8 normalized

\p{L}

AB...Zab...zĄĆĘŁŃÓŚŹŻąćęłńóśźż

\p{N}

0123456789

\p{Ll}

ab...ząćęłńóśźż

\p{Lu}

AB...ZĄĆĘŁŃÓŚŹŻ

\p{P}

!"#%&'()*,-./:;?@[\]_{}

\p{S}

$+<=>^`|~

\p{S}|\p{P}

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

[A-Z]

AB...Z

[a-z]

ab...z

We see that PCRE after a “normalization” with enc2utf8() gives correct results in all the locales.

An example:

gregexpr(enc2utf8(pattern), enc2utf8(text), perl=TRUE)

With the stringr package you may use e.g.:

str_extract_all(enc2utf8(text), perl(enc2utf8(pattern)))

Note that regexec() (and str_match_all() from stringr) currently doesn’t support PRE. However, you may use gregexpr() instead.