Regular Expressions in Arc

Seeing as regular expressions are so useful in my day to day programming in both Perl and Ruby, I think it would be a useful thing to have them built into Arcueid as well, but as I am doing this I wonder at how exactly it should be done.

Obviously, matching a regex should be just as easy as applying the regex, e.g. (/(a?)a?(a?aaa)/ "aaaaaaaaaaaaaaa") should be sufficient, but then what is the value of such an expression? I suppose it should of course be nil if the match fails, but then what happens if the match succeeds?

There are several ways we could go about it:

Just return an integer with the position in the string at which the match began, and then implicitly bind $1 to be the first captured sub-string, $1 to be the second, and so forth. These variables are essentially global in scope. It is how Perl and Ruby do it though.

Make a regex match return a list with the position at which the match began and any sub-string captures (backreferences) that were created. So you can do something like: (let (pos sa sb sc) (/(a+)(b+)(c+)/ str) ... ). I do think this would be cleaner, and it would be possible I think to write a macro where one could do: (=~ /(a+)(b+)(c+)/ str ... $1 ... $2) where $1, $2, and $3 would be bound as in Perl.

And then there is the matter of a regular expression library. I’m currently trying to adapt the Plan 9 regular expression library but have found that it doesn’t support a lot of the stuff I’ve come to expect from regular expressions: in particular, it doesn’t even support something as simple as non-capturing groups, i.e. (?: ...).

There are several other libraries I’ve been looking at:

Oniguruma. Suffers from two issues: first, it’s bigger than the rest of Arcueid at the moment. Second, it does recursive backtracking and can take exponential time with some regexes. Try doing /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ =~ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" in Perl or Ruby that uses the same algorithm. That simple match takes over a minute on my 2.13 GHz Core i3 machine! The same match is almost instantaneous using the Plan 9 regex library.

libpcre. Suffers from the same exponential time backtracking problems as Oniguruma. Is also three-clause BSD, so I can’t use it without changing my license as well.

TRE. Seems to be better, but I’ll need to pare it down to something more minimal.

Other suggestions on a small (<5000 SLOC), efficient, Unicode-capable regex library are welcome. Text processing is something web apps do a lot of, and one of the basic tools for doing that is regular expressions. This, I think deserves to be baked into the language runtime even more than support for big numbers (although that's something that might become baked in sometime soon too). Arcueid may even get support for taint mode à la Perl after this.

One Response to “Regular Expressions in Arc”

> Rex is a regular expression library which implements most of the POSIX 1003.2 specification. It is very mature and has been in heavy production use by thousands of users, on many platforms, since 2002.