Introduction

I wanted to know how a regular expression parser works. So I did some Googling and found some cool articles that describe the process of how regular expressions find a match. I listed the articles in the reference section of this article. I have implemented this parser based on my research. I will not go too much into describing the process and the theory behind the regular expression, since the articles in the reference section cover this very well (the topic of regular expressions is huge and will require a book to explain thoroughly).

In this article, I will simply show an implementation of a simple Regular Expression parser (or Mini Regular Expression Parser). I will go on using the terms Automata, NFA, DFA, Minimum DFA, state, transitions, and epsilon transition. If you do not understand these terms, I highly recommend you read up on some of the articles in the reference section before continuing.

So you ask, "why this one?" This implementation is done step-by-step, so it makes it easy for someone wanting to learn how regular expressions work. Other features:

Has a GUI that helps you understand the states and transitions

Use of ^ and $ tokens to specify match at the beginning and ending of the pattern respectively

A C# implementation, quite object-oriented

Easy to understand code with comments

Has a feature allowing you to control the greediness of the parser - allowing you to experience the different behavior of greediness.

Not limited to ASCII characters (0-255)

This implementation is more complete than most parsers I came across.

So I wanted to share the implementation with CodeProject users. BTW: This is my first submission to CodeProject, hope you like it.

Features

The table below shows the symbols the parser supports.

Symbol

Description

Example

?

(Quantifier) Match the precedent character zero or one time

"A?" finds zero or one A

+

(Quantifier) Match the precedent character one or more time

"A+" finds one or more As

*

(Quantifier) Match the precedent character zero or more time

"A*" finds zero or more As

_ (underscore)

Any single character.

"a_b" will find "abb", "aab", "acb" and will not find "ab"

_*

Match any character zero or more time (wildcard).

"a_*p" will find "appleandp" in string "appleandpotato".

[ ]

Any single character within the specified range ([a-f]) or set ([abcdef]).

"[C-P]arsen' finds "Carsen", "Larsen", "Karsen", and so on.

[^]

Any single character not within the specified range ([^a-f]) or set ([^abcdef]).

"de[^l]-*' finds string beginning with "de" and where the following letter is not l.

Some parsers use a dot(.) to denote any single character, instead of _(underscore). If you want, you can change this in the source code by simply changing the value of the constant that is defined for this.

You can control the greediness of the parser by setting the RegEx.UseGreedy property. If you set this property to false and use the pattern "a_*p" in string "appleandpotato" — it will match "ap" and not "appleandp".

The parser validates the input pattern string for its correctness. If it encounters an error in the syntax, it will report error with details information (i.e., error position, length, and type of error) accurately.

Using the Code

The algorithm in this parser follows the lecture notes of Mr. Mike Clark. Unfortunately, he had taken the notes offline since I downloaded them. So I am making them available here.

I find these notes rather simple and very easy to understand. And I structured my code according to the steps mentioned in these notes. If sometimes you find it difficult to understand what my code is doing (I hope you won't), please read one of these relevant notes.

This parser is written in C# using Visual Studio 2005. Below is the partial class diagram with the key classes of the component.

The Set class is a simple representation of a Set in mathematics. Since .NET does not come with a Set class, I had to write one. The Map class is a map between a key and one or more objects, also not in .NET. The State class holds the data structure of the automata. RegEx is the main class that actually uses other classes. The RegExValidator class is used to validate a pattern string. Validation is done using Recursive Descent Parsing. Besides validating the pattern, it does two other tasks: insertion of implicit tokens making it explicit and expanding character classes. i.e.,

"AB" -> "A.B" (inserting the concatenating operator(.) )

"A.B" -> "A\.B" (inserting the escape)

"[A-C]" -> "(A|B|C)" (expanding the range)

"(AB" -> Reports error with mismatch parenthesis

You will find a description of methods of the classes in the source code as comments.

Points of Interest

The NFA models for quantifiers *, + , and ? can be found in the articles I mentioned.

When I was implementing the parser, I had a lot of trouble with a couple of transitions:

_ (underscore) -any single character

[^A] -Complement of character set

I did not find information regarding these transitions during my Googling.

After much trial and error, I came up with the NFA models that work fine. Using these models, you do not have to modify the original algorithm at all.

This "AnyChar" transition is handled in the RegEx.FindMatch method as an special case. If the current state does NOT have a transition over the current input symbol, it checks to see if the current state has a transition over the "AnyChar" symbol. If so, it uses the transition.

The complement of character set uses an "AnyChar" transition and a "Dummy" transition. If the current state uses a transition that is forbidden (i.e., A in [^A] ), it ends up in a state that has only one transition going away from it — that is the "Dummy" transition and that state is not an accepting state. A "Dummy" transition is NEVER used in the actual process and thus the parser reaches a dead-end state, effectively resulting in a mismatch. If the current state does not have any transition over the current input symbol, it uses the "AnyChar" transition and ends up in accepting state effectively matching the correct sub/string.

Also, the RegEx.FindMatch method requires some discussion as well. Without the implementation of ^(match at start), $(match at end), _(any single character), _*(wildcard - any number of any character), and greediness (control of it) features, this method would be much simple like below.

Initially I fought myself trying not to add any additional logic in the FindMatch method and was trying to find solutions for those features using the NFA model. That effort took me nowhere and I ended up adding a little addition to the logics in this method. But now all these features work very well.

Final Word

This is a simple implementation. More features can be added to this demo code if you want. If you add more features to this demo code presented here, I would very much like to have a copy of that code if it is OK with you. You can use the code anyway you like under the CodeProject license (CPOL). I don't think it would be too difficult to port this demo code to C++ if you want.

History

August 1, 2008 -- Initial submission

February 24, 2009 -- Xawiersas found an error in the code (see below) So I fixed it and restructured the code to make it more readable

About the Author

Comments and Discussions

To search with such large number of patterns, you will just have a very long regular expression string.

If you are concern about the performance, then yes, it might take reasonably long time. I don't know how long it would be, becuase I have never tried such long regualar expression. But once the "Minimum DFA" has been constrctured, the search should be just as efficient as it can be. Because this is what "Minimum DFA" gives us - a set with minimum number of states that can recogize the language. Further minimization of a "Minimum DFA" is not possible.

I was just curious why would you need so many patterns? Also, if you try out such long expression, please let me know how it performed.

I've tried to compile 1000, 2000, 4000 patterns using automata's union using library called "dk.bricks". I 've waited nearly 440 min for 4000 patterns to compile. This library is written in J#. So I try out to realize this feature using library which is represented here. If you know about this library and were using it you can find there such method as "Automaton union(list automatonList)". So the question is Can you help me with editting this library if you have some suggestions?

No, this is not I meant. Please, look through library dk.bricks. Pay attention to method: Union. In common it explains my problem but sourcecode in Java. So the library should contain such kind of method but for me it's too difficult to realize union of automatons. It's not my sourcecode. So if you have an opportunity to realize it can you explain how to do it. There is one more example of such problem's decision: code is available here: http://www.codeproject.com/KB/recipes/rexsearch.aspx but it's written on MFC. It's too difficult to port it on C#. Although this code fully realize my idea.

I looked through examples of using NPEG but I think that we misundestand each other. In my view this library is similar to all sourcecodes that I've seen. Do you have some links to realized ideas of algorithm Aho-Corasik but using regular expressions? So that I could give to automata 5 regex patterns, and after search could get results.

i do not know who aho corasik is but if you are trying to find matches .. nothing is going be better than libraries that implement grammar rules (which have no limitations in what it can match).

(!'ab'.)* 'ab'+)*

which will yield all ab combinations in any text.
set a break point at: AstNode start = visitor.AST;

I provide 2 different versions...

not sure what your trying to achieve yet besides implementing some algorithm. (You haven't provided a use case) Since this provides a collection at visitor.AST you could also use linq queries to join sets of matches from multiple visitors and query ....
Example: with linq you could do this: join all n visitor matches and retrieve complete matches between char index 6 and 15

Thank you for your thought.
Well, I would say the purposes of regular expression and e/bnf are different. In regular expression, we search for a specific pattern while e/bnf is used check for syntax (or grammer). E/bnf is inherently more powerfull than regular expression. (i.e, regular expression can not detect matching prenthesis like "(()(()))", but e/bnf can.)

Modern regular expressions are made with unlimited back refference capability - making it behave more like e/bnf. This kind of regular expression goes beyond the theoritical regular expression.

If you just want to search for a pattern in a large stream of input symbol, regualar expression would be lot faster due to the reduced DFA.