Introduction

What is a regular expression? In a nutshell, regular expressions provide a simple way to transform raw data into something useable. In the preface of Mastering Regular Expressions (O'Reilly & Associates), Jeffrey Friedl writes:

"There's a good reason that regular expressions are found in so many diverse applications: they are extremely powerful. At a low level, a regular expression describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data."

You may not know this, but regular expressions are found in the Microsft Visual Studio text search tool. It provides a very powerful way to search for complex patterns in your code (or any text file for that matter). Here are a few links on the web to help you get started with regular expressions if you've never used them before.

Getting Started

Regular expressions, while seemingly difficult to learn, are one of the most powerful tools in a programmer’s arsenal, yet many programmers never take advantage of them. You can certainly write your own text parsers that will get the job done, but doing it that way takes more time, is far more error prone, and is nowhere near as fun (IMHO).

Regex++ is a regular expression library available from http://www.boost.org. Boost provides free peer-reviewed portable C++ source libraries. Take a look at the website to learn more. We are only concerned with Regex++ for our purposes, but you may find many of their libraries useful. The original Regex++ author's website is http://ourworld.compuserve.com/homepages/John_Maddock/

Installing Regex++

Note:The following instructions will only work if you have Visual Studio 6 or 7 installed.

To install Regex++, complete the following steps (Detailed instructions are also availabe in the Regex++ download itself):

Download Regex++ from the original authors website. This way you will only get the regex library and not the entire boost library.

Unzip to the directory C:\Regex++ ( Type the path C:\Regex++ into the Extract to: field as in the image below )

Open a command prompt

Change directory to C:\Regex++\libs\regex\build In this directory you will find several make files. The one you are interested in is vc6.mak.

In order to use environment settings from Visual Studio, you must run the batch file vcvars32.bat. This should be in your path, so you shouldn't have to specify a full path to it. Just type vcvars32.bat into your command prompt window.

Type:

nmake -fvc6.mak

It will take a little while to build.

Type:

nmake -fvc6.mak install

(installs the libs and dlls in the appropriate places)

Type:

nmake -fvc6.mak clean

(You may get some errors with this one. I did, but you can just delete the intermediate files manually, if need be)

Now that your library is built and in place, it is ready to use. The project that I've included above is intended to demonstrate how you can simply parse HTML. All you need to do now is open the project and ensure that project settings are pointing to the appropriate regex++ lib and include directories. But first a short discussion

Note:To add the Regex++ library to your project select Project | Settings.... In the ensuing dialog, select the C/C++ tab. In the Category drop down list, select Preprocessor. In the Additional include directories: edit box enter C:\Regex++. Now select the Link tab. In the Category drop down list, select Input. In the Additional library path: edit box enter C:\Regex++.

Parsing HTML

HTML parsers are nothing new. There is really no reason someone should have to write their own (that I can think of, at least) since the wheel has already been invented. That being said, the example we are going to be using does just that--parses HTML. I do this because parsing HTML provides a good pedagogical example. Specifically, it parses form elements in an HTML document. This is a fairly complex task to accomplish, however, using regular expressions makes it simple. We are going to want our parser to be generic enough to parse what will amount to key value pairs in any given input field. For instance, in the HTML:

<inputtype="text"name="address"size=30maxlength= "100">

we would like to just supply the key name ( e.g. type, name, size, etc. ) and have the regex return that key's corresponding value ( e.g. text, address, 30, etc. ). Notice that some values have quotes and some don't. Some use white space and others don't. These are things we're going to have to account for in our regular expression. We also have to account for a different order for each parameter. For instance this:

<inputtype="text"name="address"size=30maxlength= "100">

is the same as this:

<inputname="address"type="text"maxlength="100"size="30">

In the sample application example I build a single string from the HTML input file (we'll read the whole file into a CString variable). While this may cause problems on very large files, for our purposes we'll assume that the file is fairly small. We'll need the whole string in order to match across line barriers--but more on that later.

ParseFile Method

In the ParseFile method we:

Pass in the filename of the HTML file to parse (must contain a <FORM> and input elements (e.g. INPUT, SELECT, TEXTAREA) or you won't see any output. )

Read the whole file into a string

Create a Regular Expression object ( RegEx )

Call Grep on the file string for the pattern we want and place the matches we found into an STL vector

Iterate through each item that was placed into the vector

Call GetActualType() which creates another regex to acquire which type we found (e.g. INPUT, TEXTAREA, SELECT)

Call GetValue() passing the key (e.g. type, name, etc.)

Generate and print out a string with the values we've acquired

Note:The code snippets in this article contain regular expressions that use escape characters. Because these are C/C++ strings being used, these escape characters have to be escaped twice. That is, the regex whitespace escape character (\s) will actually look like this: \\s. And a quotation mark would look like this: \\\" -- the first escapes the backslash and the second escapes the quotation mark.

The expr object gets constructed with a pattern. I will break down the pattern as follows:

(<\s* // Match on an open tag "<" and zero or
// more white space characters
(textarea|input|select)\s+[^>]+> // 1. Match on either textarea, input, or select
123// 2. look for one or more spaces next
// 3. Match on one or more characters that
// are not a ">" until we find the end ">"
[^<>]* // Match on zero or more characters that are not
// "<" or ">"
(</(select|textarea)>)?) // Match on an end tag "</" and either a select or
// a text area. The question mark means that everything
// inside the quotes is optional(e.g. 0 or 1 occurrences).

Note:In this previous description escape characters are not escaped twice. This is the way the actual regular expression would look if you printed it out.

Just as a reminder the regex operators above mean:

Character

Description

Usage

*

Match Zero or more of previous expression.

"\s*" -- zero or more white space chars

+

Match one or more of previous expression

"\s+" -- one ore more white space chars

[^]

Negation set.

"[^<]" -- Match any char that is not a less than "<" char. Can be a list of characters to negate (e.g. [^<>/] -- match anything not a less than, a greater than, or a forward slash)

The Grep method takes a reference to the vector created above it. After the Grep call, the vector will contain all matches found. Using Grep() as opposed to Search() (which is another useful method), will allow you to match across line barriers. This is important for a file you read in--especially HTML files that allow for a fairly loose format. For instance this:

<inputtype="text"name="name">

is the same as this:

<inputtype="text"name="name">

in any web browser. We need to account for this. If you are wondering about case-sensitivity, look at the instantiation of the RegEx object. The second parameter is a boolean. This indicates whether you would like it to be case-insensitive--which we do in the example code.

If you would like further information about the boost Regex++ library API, take a look at:

GetActualType Method

In the GetActualType method we extract the type of input field we're dealing with on the current line. Remember that in the ParseFile method we made sure that there was at least one input type of some sort, so this line is pretty much guaranteed to have one. Here is the method implementation:

Here we are saying look for an opening brace "<" and possibly some white space. Then look for either "input", "textarea", or "Select". Then there may be some more white space. Notice the two sets of parentheses around input|textarea|select. The inner set of parens tell us that this is a set of possible values. The pipe (|) (a.k.a. "or") here tells us that a match could contain any one of the three values. The outer parens captures what we did find into a special variable. So, if you ran this HTML code through our parser:

<inputtype= "text"name="email"size="20">

exp[1] would now contain the word "input". If your line had other parens for capturing a part of the match, they would be placed in exp[n] where n is the current set of parens counted left to right, outside to inside.

GetValue Method

In the GetValue method we pass in a key to look for and a pointer to the variable we want to populate with the value.

This is our most complex pattern yet. First we look for some possible whitespace, an equals sign, and some more possible whitespace. Then we're looking for an opening quote. The question mark means 0 or 1 of the previous expression, so if the HTML didn't include an opening quote, we are accounting for that. That is if the line looked like either of the following (notice the quotation marks), it would still find a match:

<inputtype="text"name="email"><inputtype=textname=email>

Next we're looking for any character(s) except a quotation mark ("), an opening brace (<), or a closing brace (>). This is our value. Notice that there are parens around this value because we want to capture that value into our special variable exp[n]. Next we are looking for a closing quotation mark and a possible close quote.

This is the end of our need for regular expressions. We now have the value we were looking for and can format it and output it in the list box. What you do with the values is up to you, but now you have all you need to parse HTML accurately and effectively. The example code may need some tweaking, but in general it gets the job done.

Running The Example

The example application I've included parses an HTML file that contains a form. For convenience sake, I've included an HTML form file in the project. The filename is contact_form.html and it can be found in the root directory of the project. When you run the application, simply click the "Browse..." button and select this file. Then click "Try It!"

Conclusion

While we could have built our parser using strtok or other tokenizers, these are not completely ideal for HTML since HTML can be so free form (e.g. a space here, quotes there, but not there, line wrap, etc.). Regular expressions are perfectly suited for just this sort of text parsing.

Regex++ is a very robust regular expression library that you will find very useful in your applications. Take a look at the example project and familiarize yourself with regular expression syntax. This will give you the ability to create powerful text parsers with minimal coding and will enable you to "master your data".

Share

About the Author

Matt Long is the Director of Technology for Skye Road Systems, Inc. in Colorado Springs, Colorado. He provides software architecture consulting services to small businesses. To contact Matt ( perlmunger ) send an email to matt@skyeroadsystems.com.