Friday, 19 December 2008

Perl-style regular expressions treat 12 punctuation characters as metacharacters outside character classes. These characters need to be escaped with a backslash if you want to include them as literal characters in your regex:

.^$|*+?()[{\

Inside character classes, these flavors treat a different set of 4 punctuation characters as metacharacters. Only those 4 need to be escaped to be included literally in character classes:

]^-\

The POSIX ERE flavor, which Perl derives from, has strict rules about escaping characters with a backslash. Outside character classes, only metacharacters may be escaped. Escaping anything else is an error. Inside character classes, POSIX ERE treats the backslash as a literal, so you can’t use it to escape anything. Clever placement as in []^-] is then you’re only option to include the other 3 metacharacters.

Perl is more flexible. It allows all punctuation characters to be escaped. Only escaping letters that don’t create something with a special meaning is an error. E.g. \b is a word boundary, while \J is an error.

Thus, in Perl, the regular expressions &[lg]t; and \&[lg]t\; are equivalent, and a lot of developers use the second variant in their code. It seems a lot of people like to escape punctuation “just in case”. Don’t needlessly escape literal characters. It’s a bad habit with several bad effects.

It makes you look like a newbie when you don’t know which characters really need to be escaped, and which don’t.

A few extra backslashes can quickly grow into a forest of backslashes. Those two regular expression, when included as strings in Java source code, become "&[lg]t;" and "\\&[lg]t\\;". Imagine a long regex with lots of backslashes, some requried, some superfluous, but all doubled up. The regular expression syntax is difficult to read as it is. Don’t make it even more complicated.

But most importantly: you, or other people copying your regex, may run into flavor-specific issues when escaping certain literals. A regex that works fine in Perl may fail in .NET or on the command line with egrep (which uses POSIX ERE).

As I’ve mentioned, Perl doesn’t allow literal letters to be escaped. The .NET developers got this wrong when imitating the Perl syntax, and don’t allow any word characters to be escaped. Word characters include the underscore. \_ matches a literal underscore in Perl, but causes an error in .NET.

In many regex flavors, including the GNU implementation of POSIX ERE used in GNU egrep and many other open source projects, \< and \> aren’t needlessly escaped angle brackets. They’re word boundaries, matching the start of a word or the end of a word. GNU can extend POSIX this way, because \< and \> were illegal in the POSIX standard, and thus could never occur in a regex. Perl’s flexibility means that these word boundaries can never be supported by Perl using the same syntax, because it would break too many regular expressions created by developers who don’t know that angle brackets aren’t metacharacters. This creates a practical problem for developers who use both Perl and GNU utilities: Perl will happily take the GNU ERE regex \<word\>, but it won’t work as intended. GNU egrep will match word as a whole word only, while Perl looks for <word>.

Because of this, RegexBuddy flags \< and \> as an error on the Create tab when you’re using a flavor that doesn’t support these as word boundaries. It does this to make sure you’re not accidentally trying to use these as word boundaries with flavors that don’t support them. To make the error go away, either remove the needless backslash if you meant to match a literal angle bracket, or double-click the error on the Create tab to replace the escaped angle bracket with a lookaround combination that emulates the word boundary.

On the Test tab, RegexBuddy does not complain about escaped angle brackets if the selected regex flavor doesn’t complain about them either. So you can ignore RegexBuddy’s warning and see the same results with escaped angle brackets in RegexBuddy as with the actual regex engine that you’ve asked RegexBuddy to emulate.

Comments Off on Don’t Escape Literal Characters That Aren’t Metacharacters