.NET

Regular Expressions in C++

Source Code Accompanies This Article. Download It Now.

The author of the Boost regex library shows that is C++ as versatile for text processing as script-based languages like Awk and Perl.

You need to extract the three fields, look up the item in the specified table, and then format the specified field as a string. However, there are a couple of complications here: First, the three fields (table, item, and field) can occur in any order; and second, some of the fields may be omitted, in which case default values should be used. Listing Four is the complete code for such an application.

The key to understanding Listing Four is the regular expression at the start of the code. The expression uses a bounded repeat to match the three fields (table, item, and field). This allows up to two of those fields to be absent and acquire default values. Each time the expression repeats, one of the fields (regardless of which order they appear in) will be matched and then marked for future reference by its enclosing parentheses. Any absent fields will end up with their marked subexpression containing a Null string (so we will know that that field is absent).

The core of Listing Four is remarkably simple  the input HTML file is loaded into a string (a better choice would be a memory-mapped file, however, that would require platform-specific assumptions and I wanted to avoid such complications in an example program like this). Then, the loaded file is used as input to the algorithm regex_grep, which simply searches through the file for all matches of the regular expression. For each match, it fills in an instance of boost::match_results and passes that instance to either a callback function or a function object. The match_results class stores a set of iterators that indicate what matched each subexpression of the regular expression, so the callback function can easily extract the information needed to perform the database lookup. In Listing Four, I've simply created a dummy string from the data instead of performing an actual lookup  in real-world code, the lookup_datamerge_string procedure should be replaced by the actual database lookup code.

Of course, this isn't the only way of solving this particular problem. For example, server-side scripting via ASP or PHP pages is a popular choice. However, the method presented here does have the advantage of simplicity, and of completely separating web page design from programming  something that is important in many environments. There are also some simplifications in Listing Four; particularly the use of a global callback function, along with some global data. These can be eliminated completely by using a function object rather than a callback function as an argument to regex_grep. It is also possible to forward to a class member function using this technique (there are examples that do this in the the library documentation for regex_grep).

There's a lot more that this library can do, however, there isn't space to cover it all here. In this article, I've used std::string extensively to keep the examples as simple as possible, but underneath, the library is completely iterator based and will accept any bidirectional iterator as text input. There are also algorithms for Perl-like split operations that aren't covered here, along with narrow and wide character versions of the traditional POSIX C API functions.

Conclusion

This article shows some of the power that regular expressions in C++ can give you. This library does not seek to replace traditional regex tools such as lex. Rather, it provides a more convenient interface for rapid access to all kinds of pattern matching and text processing  something that has traditionally been limited to scripting languages. In addition, it provides a modern iterator-based implementation that allows it to work seamlessly with the C++ Standard Library, providing the versatility that C++ users have come to expect from modern libraries.

Acknowledgments

Thanks to Steve Cleary for his helpful comments while preparing this article, and also to all the library users who have contributed feedback to this project  without their help, this would never have come this far.

John is an independent programmer in England with an interest in generic programming. He can be contacted at john_maddock@compuserve.com.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!