WL#353: Better REGEXP package

We should try to replace the current REGEXP package (Henry Spencer's) with a
newer one because the current one is extremely slow when you search after
multiple strings:
text REGEXP 'tom|david|monty'
Sergei: Not only because it's slow, but in 4.1 the main reason is - because it
only works with simple charsets, no multi-byte, no complex (german) collation rules.
The challenge is to find a compatible but faster REGEXP library that we can
modify to work with our character set handling...
Examples of multi-byte aware regex libraries: pcre, tre.
References:
- BUG#30241 Regular expression problems

Possible candidates :
From Matt (pcre) :
In my experience, many of the people who really love regexp are also fairly
heavy users of perl. So my first thought would be:
http://www.pcre.org/
Perhaps that's a bit too big/heavy for our purposes though?
Looks like it uses the standard BSD license, which AFAIK is good for us.
Attached is the LICENCE (sic) file from the pcre-8.33.tar.gz file.
They also note how you can avoid the GPL by building against libedit in the
README file:
"
. It is possible to compile pcretest so that it links with the libreadline
or libedit libraries, by specifying, respectively,
--enable-pcretest-libreadline or --enable-pcretest-libedit
If this is done, when pcretest's input is from a terminal, it reads it using
the readline() function. This provides line-editing and history facilities.
Note that libreadline is GPL-licenced, so if you distribute a binary of
pcretest linked in this way, there may be licensing issues. These can be
avoided by linking with libedit (which has a BSD licence) instead.
"
More good news, it built very quickly and w/o a hitch on my OS X 10.8 laptop:
noob:pcre-8.33 matt$ make test
make check-am
make
make all-am
make[3]: Nothing to be done for `all-am'.
make check-TESTS
PASS: pcrecpp_unittest
PASS: pcre_scanner_unittest
PASS: pcre_stringpiece_unittest
PASS: RunTest
PASS: RunGrepTest
make all-am
make[5]: Nothing to be done for `all-am'.
============================================================================
Testsuite summary for PCRE 8.33
============================================================================
# TOTAL: 5
# PASS: 5
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
So we know it's not terribly linux-centric source, and it will hopefully not
require much work on our part to get it to build on all supported platforms.
Windows builds may require cygwin though...

Objective:
To replace the current regex library with a new one as the current one
is slow and doesnt have multi-byte support.
Make the regex engine pluggable, so that any library can be added as a plugin.
Multi-byte aware REGEX libraries:
PCRE --> BSD license. Widely Used. Used by Apache.
TRE --> 2-clause BSD-like license. Supports Approximate matching.
RE2 --> BSD License. Uses automata theory instead of backtracking.
very fast for large inputs. Useful in multithreaded environments
where
thread stacks cannot grow arbitrarily large.
developed and used by Google.
Oniguruma --> BSD license. Comparable speeds to RE2 and PCRE. Looks promising.
ICU Regex --> ICU License. Inferior speeds compared to most of the libraries
above.
Boost:Regex-> Boost License. Pro: Supports different locales. Con: Has no native
UTF-8 support.Must be used over ICU Regex library to enable UTF
support.
Conclusions from above.
I have spoken to Tor regarding Boost:Regex. We cannot use it ATM, probably
early next year.
TRU, ICU Regex is slow.
RE2, PCRE seem really good candidates.
For general comparison among all the regex libraries, these are a few good
links.
http://lh3lh3.users.sourceforge.net/reb.shtml
http://softwareramblings.com/2008/08/c-regular-expression-performance.html
http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html
https://github.com/ellzey/libevhtp/issues/31
A good comparison (in terms of speed) of PCRE and RE2 is detailed here.
http://swtch.com/~rsc/regexp/regexp3.html
At this point, I will be narrowing down my research to only PCRE and RE2.
Some notes on PCRE and RE2:
Charsets:
-> RE2 supports only UTF-8 and Latin-1.Collating elements and collation classes
are not supported.
-> PCRE supports UTF 8,16,32.
-> PCRE has complete UNICODE property support, while RE2 has some support.
Since the structures used are different from what we current use (CHARSET_INFO),
mapping of our collation rules for a particular collation (like case
insensitivity, may be multi-line etc) needs to be done so that the new library
understands it.
A few approaches that can be taken:
1)Convert everything to UTF-8 or a standard wide character set, send it to the
new RE library and convert back the result. Slow and not-so-good approach.
2)Check out the rules in specific collations and supply them as arguments.For
ex: utf_general_ci is case-insensitive. So we can pass in PCRE_CASEINSENSITIVE
and so on.
But method (2) may not be applicable to all collation rules. So we have to
shift back to method (1), in those cases. This may end up deteriorating
performance at times.
Combination of the above 2 approaches is a pretty good and doable approach.
But there is a better approach with some limitations. And thats method (3).
3)Change PCRE/RE2 so that it complies with our charset.But how feasible is
this?Changing any of these libraries following a similar approach as Bar had
taken to change original Henry Spencers Regex to suit our needs, is possibly the
best option, but it will require lots of code changes (and time). Also the new
libraries are much bigger than the current one and a drawback of this would be
that integrating future versions will become a pain.
Continuing with the comparison between PCRE and RE2,
Development Status and Stability:
Both have stable versions released and further development and enhancement is
going on.
Multithreading, global variables and locks:
PCRE uses no global variables except, the call to functions pcre_free,
pcre_malloc and of the kind.
The compiled form of a regular expression is not altered during matching, so
the same compiled pattern
can safely be used by several threads at once.
If the just-in-time (JIT) optimization feature is being used, it needs separate
memory stack areas for each thread. Here pthreads are used.
(I dont think we will be using the JIT feature so I didn't read up about that)
RE2 only uses global mutexs and call to mutex related functions.
Each of the 2 has their own malloc free realloc functions.I guess they can be
tweaked to suit our purposes.
Both the libraries are thread safe.
Quoting from one of the links above: RE2's DFA is more thread friendly.
In both the cases, matching can run simultaneosuly in different threads with the
compiled
regular expressions.Only during compilation some locking is involved in RE2,
most of ehich is
around memeory allocation and deallocation.
Both the libraries do not use setjmp/longjmp() as they slow down things.But PCRE
uses it when NO_RECURSE is it.
Memory Usage:
PCRE can use a large recursive stack and have exponential runtime on certain
patterns. RE2 uses a fixed stack and guarantees that run-time increases linearly
with the size of the input. The maximum memory allocated with RE2 can be
configured if you have good knowledge of the workings of its code.
Google's RE2 has a slightly smaller set of features than PCRE, but has very
predictable run-time and a maximum memory allotment. Making it suitable for use
in server applications which require precise boundaries on memory usage and
computational time. PCRE on the other hand, has almost all of the features that
a regular expression library can have. But PCRE's run-time and memory usage is
not predictable and can grow unbounded.
The compiled form of an RE2 is bigger than that of a PCRE object, a few
kilobytes vs a few hundred bytes for a typical small regular expression. RE2
does more analysis of the regexp during compilation and stores a richer form
than PCRE. Thus compilation time of RE2 is 3-4x more than PCRE.
Error code:
Both of them have their own error codes which can be tweaked to conform to
mysql error code standards.
Signal and exception Handling:
As per my reading up, nothing much is done on these lines in the libraries.We
can
handle it on the server side, just as we are doing now.
Pros and Cons:
-> RE2 doesnt support back references.
-> It is slow compared to PCRE for small strings. Real fast for long strings.
-> It is completely automata based.Does everything in linear time. PCRE may end
up
taking exponential time.
-> In case of RE2, no need to worry about stack overflow, contrary to PCRE where
a
minimum amount of stack space is always required (unless NO_RECURSE is set,
which may slow down things).
Small tests performed by me.
I compiled PCRE with UTF support (--enable-utf) and RE2.
The BUG#30241 seems to be solved.
Also I wrote a small API which calls either of the 2 libraries based on command
line arguments.
Some BenchMark test Results (performed by me):
BenchMark test Results:
--------------------------------------------------------------------------------
----------------
| Time taken for 1000000 runs
(in seconds)
| | my_regex |
PCRE
--------------------------------------------------| | |
RE: '^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)' | 6.95 |
3.03
String: 'http://www.linux.com/' | |
--------------------------------------------------------------------------------
----------------
RE: 'usd [+-]?[0-9]+.[0-9][0-9]' | 2.03 |
0.46
String: 'same same same' | |
--------------------------------------------------------------------------------
----------------
RE: '\b(\w+)(\s+\1)+\b' | 0.17 |
23.06
String: 'http://www.thelinuxshow.com/main.php3' | |
--------------------------------------------------------------------------------
----------------
PCRE is real slow for the third regex under all strings.RE2 is about 10.06
seconds.
To the best of my knowledge, I have done everything correctly, but unknowing
mistakes may exist.
ONIGURUMA:
I didnt research much upon this unless suddenly it looked quite promising. USed
by TextMate, SublimeText, PHP.
The characteristics of this library is that different character encoding
for every regular expression object can be specified.
The code base and files are similar to what we have.May be this is modifiable?
This is the ONIG character class.
http://pastebin.no.oracle.com/index.php/view/90543385
This too is a multi-byte library, solves the BUG#30241 and multithreading
support can be enabled.In that case no global variables are used.
It has almost the same set of features than PCRE and uses backtracking as well.
Error code, again easily mappable. Memory allocation adn deallocation is similar
to what we have currently, and can be tweaked.
There is not much you can find in google about Onig, compared to PCRE.
There are loads of documents/ test/ comparisons about PCRE but not so much for
oNIG.
Making things pluggable:
What can be done here is, isolate the current library regex/ code and place it
in the plugins folder.
Make a generic interface on the server side, with one or may be two generic
function calls.
This generic function call will speak to the required library (based on the .so
file included).
Now, for every new library we add as a plugin, a 'bridge' / interface has to be
written which
will do the required communication from server to the regex engine.
Now this bridge, can either convert everything to UTF-8 (or a wide character
set) by calling
the sql string service functions, or amy be something else.
If we follow method (3) as mentioned above, every library has to be tweaked to
support our CHARSET_INFO
structure.Then this bridge concept is not required.
Conclusins:
In my opinion, since we will make it pluggable somehow (by somehow I mean,
either charset conversion or writing an interface),
let us first change the library to ONIG or PCRE or may be both and then add
support to the other.We will keep the one we have also.
I dont think RE2 will be great choice to us.The major advantages of RE2 being
-> Regex which have '|'
-> Linear time in worst case and no stack overflow.
We can do a slight trade off on point number (1) and check for stack overflow as
we are doing.
RE2 has cons also as mentioned above.
TODO:
Make calls to regexp_match multithreaded.