Tuesday, July 18, 2017

Upcoming RegExp Features

Regular expressions, or RegExps, are an important part of the JavaScript language. When used properly, they can greatly simplify string processing.

ES2015 introduced many new features to the JavaScript language, including significant improvements to the regular expression syntax with the Unicode (/u) and sticky (/y) flags. But development has not stopped since then — in tight collaboration with other members at TC39 (the ECMAScript standards body), the V8 team has proposed and co-designed several new features to make RegExps even more powerful.

These features are currently being proposed for inclusion in the JavaScript specification. Even though the proposals have not been fully accepted, they are already at Stage 3 in the TC39 process. We have implemented these features behind flags (see below) in order to be able to provide timely design and implementation feedback before the specification is finalized.

In this blog post we want to give you a preview of this exciting future. If you'd like to follow along with the upcoming examples, enable experimental JavaScript features at chrome://flags/#enable-javascript-harmony.

Named Captures

Regular expressions can contain so-called captures (or groups), which can capture a portion of the matched text. So far, developers could only refer to these captures by their numeric index, which is determined by the position of the capture within the pattern.

But regular expressions are already notoriously difficult to read, write, and maintain, and numeric references can add further complications. For instance, in longer patterns it can be tricky to determine the index of a particular capture:

/(?:(.)(.(?<=[^(])(.)))/ // Index of the last capture?

And even worse, changes to a pattern can potentially shift the indices of all existing captures:

/(a)(b)(c)\3\2\1/ // A few simple numbered backreferences.
/(.)(a)(b)(c)\4\3\2/ // All need to be updated.

Named captures are an upcoming feature that helps mitigate these issues by allowing developers to assign names to captures. The syntax is similar to Perl, Java, .Net, and Ruby:

Unicode Property Escapes

Regular expression syntax has always included shorthands for certain character classes. \d represents digits and is really just [0-9]; \w is short for word characters, or [A-Za-z0-9_].

With Unicode awareness introduced in ES2015, there are suddenly many more characters that could be considered numbers, for example the circled digit one: ①; or considered word characters, for example the Chinese character for snow: 雪.

Neither of these can be matched with \d or \w. Changing the meaning of these shorthands would break existing regular expression patterns.

Instead, new character classes are being introduced. Note that they are only available for Unicode-aware RegExps denoted by the /u flag.

Lookbehind Assertions

Lookahead assertions have been part of JavaScript’s regular expression syntax from the start. Their counterpart, lookbehind assertions, are finally being introduced. Some of you may remember that this has been part of V8 for quite some time already. We even use lookbehind asserts under the hood to implement the Unicode flag specified in ES2015.

The name already describes its meaning pretty well. It offers a way to restrict a pattern to only match if preceded by the pattern in the lookbehind group. It comes in both matching and non-matching flavors:

Acknowledgements

This blog post wouldn’t be complete without mentioning some of the people that have worked hard to make this happen: especially language champions Mathias Bynens, Dan Ehrenberg, Claude Pache, Brian Terlson, Thomas Wood, Gorkem Yakin, and Irregexp guru Erik Corry; but also everyone else who has contributed to the language specification and V8’s implementation of these features.