Annotation

This article can be helpful for those who potentially need to solve complex parsing tasks that require using more than one Regular Expression.

Problem background: Superposed regexes

Often it is impossible to write a single regex to parse exactly what you need from a text. In those cases, if even a single regex can be written,
it may get a grotesque form that is hard to write and even harder to perform. Hence to solve such a parse task, you have to write several regexes
so that they are applied in turn one after another to the text or to the results of the previous regex. This brings us to the issues listed below.

Those who have written code for such parsing knows that debugging of superposed regex constructions is dull pastime. It is so because although there are a lot of regex debug
tools, in the case of superposed regexes, they become not too handy as we can debug only one regex at once. That means you have to intercept captures of the previous regex
in order to debug the regex that is applied after. Keeping in mind that debugging should be performed on many matches to get confidence in the regex, it appears like a real headache.
Yet again the same problem arises while updating the parser when new peculiarities in the input text are found that occurs often enough.

Another problem is that the parsing code becomes non-readable and intricate because of the presence of many regexes and superposed operations on parsing results.
The code around regexes has to conform to the logic that they dictate, so in most cases, you will not be able to change the regexes without changing the code, and vice versa.

As a result, you lose your time and end up with obscure code that is a total maintenance nightmare and cannot ever be reused because it is designed to do the very specific
thing that's bound to change at any moment.

This article presents a solution that eliminates these problems.

Terminology

First, a little bit of terminology.

In this article, we will use the term regex tree. It means 'tree of regexes' (that's a construction of superposed regexes), and so differs
from the 'tree of the regex components' that is explained in other writings.

Regex tree file is an XML file of predefined format where the regex tree is stored.

Parsed data tree is a tree-like structure returned by Cliver.Parser as a result of parsing by a regex tree.

The solution essence

The general objectives of the solution are:

Simplify building and debugging regex trees with a GUI utility. It is done by a tool called RegexTreeer. It provides the following advantages:

modification and debugging of regex trees is visual and easy;

use of regexes in a tree like manner allows simplifying the regex syntax and allows complex parsing to novices in regexes;

regex tree is stored in an XML file of predefined format so that the regex tree can be opened and updated with RegexTreeer anytime after;

Keep regexes and the parsing process separated from the code where the parsed data is used. It is done with the Cliver.Parser component exposed
by RegexTreeer. It provides the following advantages:

parsing results are formed as a tree-like structure where data can be addressed by an obvious path;

regexes being stored in a regex tree file do not obscure your code so that it appears simple and not-depending on regexes;

the same regex tree file can be used by many parsers/applications.

RegexTreeer

Development of a parser with RegexTreeer implies the following general steps:

building a regex tree with RegexTreeer;

storing a regex tree in a regex tree file;

embedding Cliver.Parser in your code;

getting parsed data as a tree-like structure;

Generally, after you have built a regex tree with RegexTreeer, there are three ways to use the regexes as it is displayed in the diagram:

We'll consider how to use Cliver.Parser since it is meant as a main way of RegexTreeer use.

Cliver.Parser

Cliver.Parser is a .NET library that performs parsing text by regex trees that were built with RegexTreeer. It can be linked to your code by adding a reference
to RegexTreeer.exe in your project.

The general objective of Cliver.Parser is to keep regexes and the parsing process separate from the code where the parsed data is used.
That means the code does not depend on regexes anymore, and thus they can be changed without requiring changes in the code, and vice versa. That's why
although RegexTreeer remains yet a helpful tool without Cliver.Parser, the last one makes the parsing solution perfect.

Example

To bettee understand how it works, let's consider the following example. We want to parse from the text below company names, addresses, and all information for each
staff person: name, phones, mobile, email, etc., as separate fields.

In order to obtain structured data, we'll have to apply to the text several superposed regexes (e.g., a regex tree).

Regex tree

The needed regex tree can be built using RegexTreeer. We'll not review the RegexTreeer interface here because it is simple enough. Having taken a brief look at RegexTreeer Help,
you can quickly learn how to build regex trees there. Let’s imagine the regex tree is already created and saved in a regex tree file named Companies.rgx.

You can see the RegexTreeer screenshot with the regex tree that was built for our example:

The regex tree is seen in the TreeView control in RegexTreeer's window. The used regex engine is .NET, so refer MSDN for the regex syntax.
Please notice an important thing: those regex groups whose captures are the end data are named. You’ll see below that using these group names, we reference the parsed data in our code.

The regex tree for our example has the following structure (you can see the regexes in the screenshot, or find them in the regex tree file in the code attached to this article):

As we can see from the regex tree diagram, the parsed data will be a tree of named values. This observation directs us to the next section.

Parsed data tree

The view of the regex tree suggests that it would be fine to obtain the parsing results formed as a tree-like structure – then we can
manage the parsed data in our code in a clear and vivid manner. Thus, while iterating through the array of companies, we would get the company's
name like Company[i].CompanyName, or employee's phone like Company[i].Employee[j].EmployeePhone[k].

Cliver.Parser does something like this - it returns the parsed data as a tree of GroupCapture objects that correlates to the regex tree that was applied to text.
Each string captured by a named group in the regex tree is represented by its GroupCapture object. Each GroupCapture contains its captured string
and also keeps references to GroupCaptures originated by the next level (i.e., child) regex.

To clarify this better, let's consider the parsed data tree being a result of parsing of our example text. Below it is represented in JSON form:

As you can see, any parsed value can be accessed by a name path, like a certain employee's
phone: gc["Company"][0]["Employee"][0]["EmployeePhone"][1].Value.

It looks much better than if regexes were within the code, doesn’t it?

Using no-named groups

Draw your attention, a parsed data tree can contain only captures of named groups, while captures of no-named groups are not taken to the parsed data tree,
in spite of the fact that they participate in the parsing process. That means, if you leave certain groups no-named, then captures of the next regex that
is applied to the captures of the no-named groups are collected into one array.

In our example, leaving groups of regex #1.1 no-named means that all captures of regex #1.1.1 will be placed into one array with no distinguishing what capture
of group $1 of regex #1.1 was parsed. (Of course, the same can be said about regex 1.1.2 too.) We can do so because captures of regex #1.1 are not the end data
used in the code, and also, as expected, regex #1.1 has only one match within each company. Thus, leaving its groups no-named, we only made the reference path to the data shorter by one name.

Tip: use simple regexes

Regular Expressions are flexible, and powerful enough to allow in many cases writing one regex instead of two or more. However, such travail usually
results in non-readable, non-editable code that is hypersensitive for the parsed text’s deviations.

So do not try to use complex regexes, instead, use a tree of regexes which are as simple as possible. This approach provides a clear logic of data hierarchy that will
save your development time. In most cases, it also brings the highest performance.

The conclusion

This article is only an outline of using RegexTreeer + Cliver.Parser embedded in .NET code. If you want to use this technology, you have to refer
RegexTreeer’s help where you can find more detailed information.

The RegexTreeer install package can be downloaded from here. It contains samples including the considered
example (find Companies.txt and Companies.rgx there).

For those who are interested, the sources of RegexTreeer and Cliver.Parser can be found here.

Issues

RegexTreeer Help needs to be edited for better English.

Also, the RegexTreeer GUI has no professional-style icons still.

The following issue concerns RegexTreeer only while building regexes for HTML pages. It could not be solved as yet: implement the
selection coordination between HTML code and its view in web browser the same way as it is implemented in Visual Studio's HTML editor.
As a result, the selection coordination for web pages having div tags, performed by RegexTreeer, may be incorrect. Does anybody know how to implement it?