Introduction

Regular expressions are a well recognized way for describing string patterns. The following regular expression defines a floating point number with a (possibly empty) integer part, a non empty fractional part and an optional exponent:

[0-9]* \.[0-9]+ ([Ee](\+|-)?[0-9]+)?

The rules for interpreting and constructing such regular expressions are explained below. A regular expression parser takes a regular expression and a source string as arguments and returns the source position of the first match. Regular expression parsers either interpret the search pattern at runtime or they compile the regular expression into an efficient internal form (known as deterministic finite automaton). The regular expression parser described here belongs to the second category. Besides being quite fast, it also supports dictionaries of regular expressions. With the definitions $Int= [0-9], $Frac= \.[0-9]+ and $Exp= ([Ee](\+|-)?[0-9]+), the above regular expression for a floating point number can be abbreviated to $Int* $Frac $Exp?.

Interface

I separated algorithmic from interface issues. The files RexAlgorithm.h and RexAlgorithm.cpp implement the regular expression parser using only standard C++ (relying on STL), whereas the file RexInterface.h and RexInterface.cpp contain the interfaces for the end user. Currently there is only one interface, implemented in the class REXI_Search. Interfaces for replace functionality and for programming language scanners are planned for future releases.

Performance issues

A call to the member function REXI_Search::SetRegexp(strRegExp)involves quite a lot of computing. The regular expression strRegExp is analyzed and after several steps transformed into a compiled form. Because of this preprocessing work, which is not needed in the case of an interpreting regular expression parser, this regular expression parser shows its efficiency only when you apply it to large input strings or if you are searching again and again for the same regular expression. A typical application which profits from the preprocessing needed by this parser is a utility which searches all files in a directory.

Limitations

Currently Unicode is not supported. There is no fundamental reason for this limitation and I think that a later release will correct this. I just did not yet find an efficient representation of a compiled regular expression which supports Unicode.

Constructing regular expressions

Regular expressions can be built from characters and special symbols. There are some similarities between regular expressions and arithmetic expressions. The most basic elements of arithmetic expressions are numbers and expressions enclosed in parens ( ). The most basic elements of regular expressions are characters, regular expressions enclosed in parens ( ) and character sets. On the next higher level, arithmetic expressions have '*' and '/' operators, whereas regular expressions have operators indicating the multiplicity of the preceding element.

Most basic elements of regular expressions

Individual characters. e.g. "h" is a regular expression. In the string "this home" it matches the beginning of 'home'. For non printable characters, one has to use either the notation \xhh where h means a hexadecimal digit or one of the escape sequences \n \r \t \v known from "C". Because the characters * + ? . | [ ] ( ) - $ ^ have a special meaning in regular expressions, escape sequences must also be used to specify these characters literally: \* \+ \? \. \| \[ \] \( \) \- \$ \^ . Furthermore, use '\ ' to indicate a space, because this implementation skips spaces in order to support a more readable style.

Character sets enclosed in square brackets [ ]. e.g. "[A-Za-z_$]" matches any alphabetic character, the underscore and the dollar sign (the dash (-) indicates a range), e.g. [A-Za-z$_] matches "B", "b", "_", "$" and so on. A ^ immediately following the [ of a character set means 'form the inverse character set'. e.g. "[^0-9A-Za-z]" matches non-alphanumeric characters.

Expressions enclosed in round parens ( ). Any regular expression can be used on the lowest level by enclosing it in round brackets.

the dot . It means 'match any character'.

an identifier prefixed by a $. It refers to an already defined regular expression. e.g. "$Ident" stands for a user defined regular expression previously defined. Think of it as a regular expression enclosed in round parens, which has a name.

Operators indicating the multiplicity of the preceding element

Any of the above five basic regular expressions can be followed by one of the special characters * + ? /i

* meaning repetition (possibly zero times); e.g. "[0-9]*" not only matches "8" but also "87576" and even the empty string "".

+ meaning at least one occurrence; e.g. "[0-9]+" matches "8", "9185278", but not the empty string.

Catenation of regular expressions

The regular expressions described above can be catenated to form longer regular expressions. E.g. "[_A-Za-z][_A-Za-z0-9]*" is a regular expression which matches any identifier of the programming language "C", namely the first character must be alphabetic or an underscore and the following characters must be alphanumeric or an underscore. "[0-9]*\.[0-9]+" describes a floating point number with an arbitrary number of digits before the decimal point and at least one digit following the decimal point. (The decimal point must be preceded by a backslash, otherwise the dot would mean 'accept any character at this place'). "(Hallo (,how are you\?)?)\i" matches "Hallo" as well as "Hallo, how are you?" in a case insensitive way.

Alternative regular expressions

Finally - on the top level - regular expressions can be separated by the | character. The two regular expressions on the left and right side of the | are alternatives, meaning that either the left expression or the right expression should match the source text. E.g. "[0-9]+ | [A-Za-z_][A-Za-z_0-9]*" matches either an integer or a "C"-identifier.

A complex example

The programming language "C" defines a floating point constant in the following way: A floating point constant has the following parts: An integer part, a decimal point, a fraction, an exponential part beginning with e or E followed by an optional sign and digits and an optional type suffix formed by one the characters f, F, l, L. Either the integer part or the fractional part can be absent (but not both). Either the decimal point or the exponential part can be absent (but not both).

The corresponding regular expression is quite complex, but it can be simplified by using the following definitions:

$Int = "[0-9]+."
$Frac= "\.[0-9]+".
$Exp = "([Ee](\+|-)?[0-9]+)".

So we get the following expression for a floating point constant:

($Int? $Frac $Exp?|$Int \. $Exp?|$Int $Exp)[fFlL]?

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Exemple: Let's say that you have a rule named "$r01" If you look for this pattern "$r01 __FILE__" you will have an error because it will look for the rule "$r01 " which does not exist (because of the space)

Hi, am I right, that you are interested in the algorithm, not the code.

If so, I can give you a detailed description in the next days (the exact description is taken from a book on theoretical computer science). The basic theory lying behind the code is the transformation from a nondeterministic to a deterministic finite machine. A nondeterministic machine is easy to understand and implement, because it is nothing more than a set of nodes (in a graph) which go - at each node - to the neighbouring node reachable by reading one of the next characters. A nondeterministic machine - as the name implies - does not give you a unique path, but - normally - at each step (reading the next character) you can go decide to go one of many possible branches. But a wrong decision can force you to backtrack - meaning that you must undo a decision and go back to a previous state. This means in the very end, that nonderministic machines are inefficient (but many implementations use them). The good news is, that the transformation is not difficult if you use sets ( I used std::set). Furthermore this implementation gains much of its speed by using a very simple and fast - but also space consuming - representation. More about the algorithm in the next days. Regards Martin

I just want to tell you about this problem I met on SUN Solaris with gcc 2.95. Spaces are skiped with the "isspace" function which seems to also skip some other characters like "Ã" (\195). Maybe you should only compare to the ' ' character...

Regular expression parsers either interpret the search pattern at runtime or they compile the regular expression into an efficient internal form (known as deterministic finite automaton). The regular expression parser described here belongs to the second category

Does this mean I can change the string i've fed the regex and assume the expression will not have to be re-evaluated...?

I have a buffer which is updated every second or so and I must strip HTML tags each time.

Cheers!

"An expert is someone who has made all the mistakes in his or her field" - Niels Bohr

I want to find strings likeA123AB123Bbut not A123Busing the expression \([A-Z]\)[1-9]+\1 which uses so called backreferences (works in TextPad, a real cool editor).I Wonder if REXI_Search is able to process expressions like that?

Hi Markus,there are different solutions to your matching problem.REXI_Search always searches for the longest possible match not just for the first found. In most cases this is the behavour wanted,e.g. if you use a regular expression parser for scanning the tokens of a programming language, then this is just what you expect. To search for a C-like identifier you can use the following expression [_A-Za-z][_A-Za-z0-9]*to match identifiers like totalTaxValue nofDigits _myHiddenVariable

'*' is greedy by default. What you want is '.*?', which is the non-greedy version of '*'. That is, it matches as few characters as possible. Unfortunately, this RegEx processor does not appear to support non-greedy matching.

I didn't compile it under VC++5 only under VC++6,perhaps VC++5 has problems with templates whichare heavily used by this project and which are knownto be badly supported by older C++ compilers.Regards Martin Holzherr

is there any way to do it with this parser? most RegExp parsers i've run into (like this in Perl, VI, grep, MSVS, etc) all let you match the end of the line with "$", and the start of a line with "^": (^hi) matches all lines that start with "hi".

-c

Ah, but a programmer's reach should exceed his grasp, or what are late nights for?

I tried your suggestion about putting a \n at the start of the line and matching ^ as \n. but, this won't work because you strip whitespace from the regexp when you parse it - the \n gets stripped off.

-c

Ah, but a programmer's reach should exceed his grasp, or what are late nights for?

This implementation does not appear to support anchors. The ^ symbol is used inside the brackets to denote exclude characters from a match. The $ symbol is used to create a predefined expression that can then be included in later expressions.

Has anyone gotten this code to work in a UNIX environment? Linux, Solaris, etc? I've been able to compile and run the code on Linux and Solaris, but it never seems to find any matches (the exact same code works fine on Windows).

Regex++, found at http://ourworld.compuserve.com/homepages/John_Maddock/regexpp.htm, is a C++ regex library with a liberal licence, supporting Unicode.For some users, it can be an alternative class.For the author, he can take inspiration for the Unicode support of his own class.

For those mentioning PCRE, the correct URL is http://www.pcre.org/Henry Spencer renowned regex code can be found here: http://people.delphi.com/gjc/siod_regex.html (follow the links).

If you doing an extensive regular expression work, I suggest you get the RegExpp++ or boost library. If you cannot afford the size, get the PCRE. Creating a C++ wrapper for the PCRE gets you back to large size sinceyou will need at least 3 classes; main regular expressionclass, matches class and exception handling class to map theerrors to C++ exceptions.

Your question can be answered positively, although there are some limitations.To b) This regular expression parser always finds the longest string which matches the given regular expression.To a) A regular expression consisting only of '$'-identifiers separated by '|' has an answerCode, which allows the identification of the alternative found.

Unfortunately, one has to add 4 more source lines to the delivered 'REXI_Search' class to support this.

Furthermore the member 'int REXI_Search::m_nIdAnswer;'has to be made public or you have to provide a 'GetAnswer' member and most important, usethe REXI_Base member functionREXI_DefErr AddRegDefinition (string strName,string strRegExp,int nIdAnswer/* >0 */);which allows a third parameter, the answercode.

Finally, let me clearly state the advantage of thisregular expression parser compared to other freely available tools.It is fast, fast, fast. I'll bet, you will not findanother such tool which comes close to this one interms of raw speed. But I must admit, the functionREXI_Search::SetRegexp(<strexp>) is time consuming.Therefore, once again, use it if you want apply it tohuge files or a lot of files.

Your questions ( a) Is there a way to find out which alternative has been found? b) Which string is matched when there is a short and a long match possible?) can be answered positively, although there are some limitations.To b) This regular expression parser always finds the longest string which matches the given regular expression.To a) A regular expression consisting only of '$'-identifiers separated by '|' has an answerCode, which allows the identification of the alternative found.

Unfortunately, one has to add 4 more source lines to the delivered 'REXI_Search' class to support this.

Finally, let me clearly state the advantage of thisregular expression parser compared to other freely available tools.It is fast, fast, fast. I'll bet, you will not findanother such tool which comes close to this one interms of raw speed. But I must admit, the functionREXI_Search::SetRegexp() is time consuming.Therefore, once again, use it if you want apply it tohuge files or a lot of files.