Introduction

Regular expressions are a well recognized way for describing string patterns. The following regular expression defines a floating point number with a (possibly empty) integer part, a non empty fractional part and an optional exponent:

[0-9]* \.[0-9]+ ([Ee](\+|-)?[0-9]+)?

The rules for interpreting and constructing such regular expressions are explained below. A regular expression parser takes a regular expression and a source string as arguments and returns the source position of the first match. Regular expression parsers either interpret the search pattern at runtime or they compile the regular expression into an efficient internal form (known as deterministic finite automaton). The regular expression parser described here belongs to the second category. Besides being quite fast, it also supports dictionaries of regular expressions. With the definitions $Int= [0-9], $Frac= \.[0-9]+ and $Exp= ([Ee](\+|-)?[0-9]+), the above regular expression for a floating point number can be abbreviated to $Int* $Frac $Exp?.

Interface

I separated algorithmic from interface issues. The files RexAlgorithm.h and RexAlgorithm.cpp implement the regular expression parser using only standard C++ (relying on STL), whereas the file RexInterface.h and RexInterface.cpp contain the interfaces for the end user. Currently there is only one interface, implemented in the class REXI_Search. Interfaces for replace functionality and for programming language scanners are planned for future releases.

Performance issues

A call to the member function REXI_Search::SetRegexp(strRegExp)involves quite a lot of computing. The regular expression strRegExp is analyzed and after several steps transformed into a compiled form. Because of this preprocessing work, which is not needed in the case of an interpreting regular expression parser, this regular expression parser shows its efficiency only when you apply it to large input strings or if you are searching again and again for the same regular expression. A typical application which profits from the preprocessing needed by this parser is a utility which searches all files in a directory.

Limitations

Currently Unicode is not supported. There is no fundamental reason for this limitation and I think that a later release will correct this. I just did not yet find an efficient representation of a compiled regular expression which supports Unicode.

Constructing regular expressions

Regular expressions can be built from characters and special symbols. There are some similarities between regular expressions and arithmetic expressions. The most basic elements of arithmetic expressions are numbers and expressions enclosed in parens ( ). The most basic elements of regular expressions are characters, regular expressions enclosed in parens ( ) and character sets. On the next higher level, arithmetic expressions have '*' and '/' operators, whereas regular expressions have operators indicating the multiplicity of the preceding element.

Most basic elements of regular expressions

Individual characters. e.g. "h" is a regular expression. In the string "this home" it matches the beginning of 'home'. For non printable characters, one has to use either the notation \xhh where h means a hexadecimal digit or one of the escape sequences \n \r \t \v known from "C". Because the characters * + ? . | [ ] ( ) - $ ^ have a special meaning in regular expressions, escape sequences must also be used to specify these characters literally: \* \+ \? \. \| \[ \] \( \) \- \$ \^ . Furthermore, use '\ ' to indicate a space, because this implementation skips spaces in order to support a more readable style.

Character sets enclosed in square brackets [ ]. e.g. "[A-Za-z_$]" matches any alphabetic character, the underscore and the dollar sign (the dash (-) indicates a range), e.g. [A-Za-z$_] matches "B", "b", "_", "$" and so on. A ^ immediately following the [ of a character set means 'form the inverse character set'. e.g. "[^0-9A-Za-z]" matches non-alphanumeric characters.

Expressions enclosed in round parens ( ). Any regular expression can be used on the lowest level by enclosing it in round brackets.

the dot . It means 'match any character'.

an identifier prefixed by a $. It refers to an already defined regular expression. e.g. "$Ident" stands for a user defined regular expression previously defined. Think of it as a regular expression enclosed in round parens, which has a name.

Operators indicating the multiplicity of the preceding element

Any of the above five basic regular expressions can be followed by one of the special characters * + ? /i

* meaning repetition (possibly zero times); e.g. "[0-9]*" not only matches "8" but also "87576" and even the empty string "".

+ meaning at least one occurrence; e.g. "[0-9]+" matches "8", "9185278", but not the empty string.

Catenation of regular expressions

The regular expressions described above can be catenated to form longer regular expressions. E.g. "[_A-Za-z][_A-Za-z0-9]*" is a regular expression which matches any identifier of the programming language "C", namely the first character must be alphabetic or an underscore and the following characters must be alphanumeric or an underscore. "[0-9]*\.[0-9]+" describes a floating point number with an arbitrary number of digits before the decimal point and at least one digit following the decimal point. (The decimal point must be preceded by a backslash, otherwise the dot would mean 'accept any character at this place'). "(Hallo (,how are you\?)?)\i" matches "Hallo" as well as "Hallo, how are you?" in a case insensitive way.

Alternative regular expressions

Finally - on the top level - regular expressions can be separated by the | character. The two regular expressions on the left and right side of the | are alternatives, meaning that either the left expression or the right expression should match the source text. E.g. "[0-9]+ | [A-Za-z_][A-Za-z_0-9]*" matches either an integer or a "C"-identifier.

A complex example

The programming language "C" defines a floating point constant in the following way: A floating point constant has the following parts: An integer part, a decimal point, a fraction, an exponential part beginning with e or E followed by an optional sign and digits and an optional type suffix formed by one the characters f, F, l, L. Either the integer part or the fractional part can be absent (but not both). Either the decimal point or the exponential part can be absent (but not both).

The corresponding regular expression is quite complex, but it can be simplified by using the following definitions:

$Int = "[0-9]+."
$Frac= "\.[0-9]+".
$Exp = "([Ee](\+|-)?[0-9]+)".

So we get the following expression for a floating point constant:

($Int? $Frac $Exp?|$Int \. $Exp?|$Int $Exp)[fFlL]?

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

#1
I am trying to run this demo in VS2005. I am getting compile time error.
Would it be possible for you to upgrade your source to VS2005? I would very much appriciate it.

#2
when using a named regex, do you actually replace the pattern string with value of the named? For example, err= rexs.AddRegDef("$Def1","A|B");
err= rexs.AddRegDef("$Def2","C|D";
rexs.SetRegexp("$Def1|$Def2")
In this case, do you actually re-construct the string "$Def1|$Def2" to "A|B|C|D" ? or how do you do this?

Answer to your questions:
#1) I will not update it to VS2005. The main reason being, that there is a known error when concatenating regexpes (internally). It is too long a time ago that I looked into the code, so a fix would take me too much time.
#2) Yes "$Def1|$Def2" is equivalent to "(A|B)|(C|D)"
#3) It should match the input, but the decision which alternate to pick is based on set union and other set operations (quite complex).
===========
Bottom Line: Efficient regular epression interpretation is based on complex set operations.
There is no simple algorithmic direct approach for interpreting regular expressions which is also efficient. The only way for an efficient interpretation of a regular epxressions uses the transformation from an ndfa to a dfa (definite finite automaton) as is done in this project (but with at least one serious error which has not been corrected).

As to #2, so you actually re-construct the pattern string and then use the reconstructed string to create DFA, am I correct here?

Is this mean that when the following code is executed, it just stores the pattern in a key-value paired in memory for later use and does NOT actually construct the DFA right away?rexs.AddRegDef("$Def1","A|B");

Yes, I understand the fact that the text replacement will give me the same result.

But I am looking for a different solution than text replacment.

What about the idea of having seperate DFA model for each (relatively) smaller pattern in memory and being able to use the model from another pattern's model and thus creating a complex language recognizer? How dose that work? Like EBNF or BNF.

For example (psuedo):

Rule1 = "A|B|C|D"
Rule2 = "<" Rule1+ ">"
Rule3 = Rule2 (Rule1)*

(...I think you got the idea)

Here if I do text replacement it could become a huge regular expression, escentilly running into perfomance issue.

I would think that it should be possible to construct DFA for each rule seperatly and use them during the match (some how ).

You wrote: if I do text replacement it could become a huge regular expression, escentilly running into perfomance issue.
This is true for compilation performance. A more complex expressions uses more compilation resources in order to translate it from a dfa to a nfda.
But runtime performance should still be very good because the generated ndfa never backtracks.
But In any case, text replacement cannot support recursiveness and therfore cannot support grammar rules as in your example above. Furthermore the theoretical model of a pure regexp exclused recursion.
Your proposal, which seems to support recursiveness (normal grammar rules)
would result in a parser framework where individual grammar rules would follow the regexp idea but where full recursiveness would be supported between rules. This is more or less what PERL6 has realized. In PERL 6 it will be possible to write grammars as part of the language and each grammar rule would allow the possiblitities of a regular expression.
Furthermore the same idea has been implemented in boost as a library from Erich Niebler and is called Xpressive

Your idea above would result in a parser framework based on regular expressions.
As I wrote in my last answer this is possible and has been done in Pearl 6 and the boost library XPressive.
In my opinion a better approach to implement a parser framework which gives some regexp feeling are Parsing Expression Grammars.
The PEG framework supports lean, efficient and very powerful parsers, which can cope with any grammar. A PEG parser e.g. can parse C# or C++. The dominant parsing framework currently used is based on LALR grammars. LALR grammars can parse most of C# and C++ but not all. Writing a C# or C++ parser with an LALR parser requires break outs from the parser or modifications of the scanner in order to cope with grammar elements breaking the LALR framework.
A PEG parser on the other hand can peek forward (operators & and !) and is therefore very powerful. Any language grammar in use today can easily be covered by a PEG parser.
See also at (http://www.codeproject.com/KB/recipes/grammar_support_1.aspx[^])
or at (http://en.wikipedia.org/wiki/Parsing_expression_grammar[^])