Introduction

I think most people will think of Boost::Regex or PCRE when they want to use Regular Expressions in a C++ project. However, in fact, Microsoft has its own regular expression implementation as part of the ATL server, and it is called CAtlRegExp. And as a bonus, CAtlRegExp supports not only ASCII and Unicode, but also MBCS.

Supported Regular Expression Syntax

The following tables are copied from MSDN. You can note that the syntax is not exactly the same as in Perl. For example, the grouping operator is {}, while in Perl it is (), and it doesn't have the {n} (match exactly n times) as in the Perl syntax<.

Metacharacter

Meaning

.

Matches any single character.

[ ]

Indicates a character class. Matches any character inside the brackets (for example, [abc] matches "a", "b", and "c").

^

If this metacharacter occurs at the start of a character class, it negates the character class. A negated character class matches any character except those inside the brackets (for example, [^abc] matches all characters except "a", "b", and "c").

If ^ is at the beginning of the regular expression, it matches the beginning of the input (for example, ^[abc] will only match input that begins with "a", "b", or "c").

-

In a character class, indicates a range of characters (for example, [0-9] matches any of the digits "0" through "9").

?

Indicates that the preceding expression is optional: it matches once or not at all (for example, [0-9][0-9]? matches "2" and "12").

+

Indicates that the preceding expression matches one or more times (for example, [0-9]+ matches "1", "13", "666", and so on).

*

Indicates that the preceding expression matches zero or more times.

??, +?, *?

Non-greedy versions of ?, +, and *. These match as little as possible, unlike the greedy versions which match as much as possible. Example: given the input "<abc><def>", <.*?> matches "<abc>" while <.*> matches "<abc><def>".

Indicates a match group. The actual text in the input that matches the expression inside the braces can be retrieved through the CAtlREMatchContext object.

\

Escape character: interpret the next character literally (for example, [0-9]+ matches one or more digits, but [0-9]\+ matches a digit followed by a plus character). Also used for abbreviations (such as \a for any alphanumeric character; see table below).

If \ is followed by a number n, it matches the nth match group (starting from 0). Example: <{.*?}>.*?</\0> matches "<head>Contents</head>".

Note that in C++ string literals, two backslashes must be used: "\\+", "\\a", "<{.*?}>.*?</\\0>".

$

At the end of a regular expression, this character matches the end of the input. Example: [0-9]$ matches a digit at the end of the input.

Negation operator: the expression following ! does not match the input. Example: a!b matches "a" not followed by "b".

CAtlRegExp can handle abbreviations, such as \d instead of [0-9]. The abbreviations are provided by the character traits class passed in the CharTraits parameter. The predefined character traits classes provide the following abbreviations:

Abbreviation

Matches

\a

Any alphanumeric character: ([a-zA-Z0-9])

\b

White space (blank): ([ \\t])

\c

Any alphabetic character: ([a-zA-Z])

\d

Any decimal digit: ([0-9])

\h

Any hexadecimal digit: ([0-9a-fA-F])

\n

Newline: (\r|(\r?\n))

\q

A quoted string: (\"[^\"]*\")|(\'[^\']*\')

\w

A simple word: ([a-zA-Z]+)

\z

An integer: ([0-9]+)

Using the code

Although CAtlRegExp is part of the ATL server classes, you don't have to be an ATL project in order to use this class, simply #include "atlrx.h" is enough.

I have written a simple Dialog based program to test/demo the CAtlRegExp. The core of the program is listed as follows:

Special note about MBCS

By default, CAtlRegExp uses CAtlRECharTraits, which is CAtlRECharTraitsA for non-Unicode version. However, unless you are using strict and pure ASCII, you should use CAtlRECharTraitsMB; otherwise, you may encounter some un-expected results in non-ASCII text. For example, the Chinese character for ("word") in Big5 encoding is the two byte word "\0xA6 r", which has a 'r' in as the second byte.

References

History

6th March 2006: Initial version uploaded.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

About the Author

Comments and Discussions

My interest in ATL was shortlived as I've noticed Microsoft no longer include the ATL Server library (except for a few data encoding/decoding classes) in VC++ 2008. Unfortunately, CAtlRegExp was not one of the few classes they kept.

Microsoft no longer maintains or ships ATL Server with VC++ and has released it as a shared source at Codeplex[^]

I down the code, but find some unexpected bug, as follow:
1. The regExp that is "\d{4}-\d{2}-\d{2}" cann't match the string(2006-12-30).
2. The RegExp that is "http://sports\.sina\.com\.cn/\w/\d{4}-\d{2}-\d{2}/\d+.shtml" cann't match the string(http://sports.sina.com.cn/g/2007-01-04/05012672491.shtml).

I found out that the matches can only be retrieved if you use groups in your regular expression. For example use {abc} instead of abc, else the matches will not be displayed. Unlike that the boolean indicator regex.Match() does not require groups.

I am afraid to say that both of you are mistaked. Sam's simple approach really works well. not only Group search but also non-group search works. Sam simply didn't display non-group search result. His displaying only displays if the reg-exp is grouped.

A very annoying feature indeed. I am currently writing a regex find/replace dialog for end users with a very limited understanding of regular expressions. They are more likely to use simple patterns like the beginning or ending characters of a text, rather than entering complex group patterns.

This current behaviour makes it unfit for the application I am developing.

You are right. I can see this is how it works with CAtlRegExp. But on one hand, this is not documented, and on another hand, this is something specific to CAtlRegExp. Other Regex implementations like POSIX behave differently.

BTW, (123)|(abc) is then more appropriate than {123}|{abc} with CAtlRegExp

The CAtlRexExp treats the results as two groups, one of them being empty. We are looking for just one group however, so this is not the way to make an or decision, unless you are willing to assert one or more groups being empty.