C++ Regular Expressions with std::regex

The C++ standard library as defined in the C++11 standard provides support for regular expressions in the <regex> header. Prior to C++11, <regex> was part of the TR1 extension to the C++ standard library. When this website mentions std::regex, this refers to the Dinkumware implementation of the C++ standard library that is included with Visual C++ 2008 and later. It is also supported by C++Builder XE3 and later when targeting Win64. In Visual C++ 2008, the namespace is std::tr1::regex rather than std::regex.

C++Builder 10 and later support the Dinkumware implementation std::regex when targeting Win32 if you disable the option to use the classic Borland compiler. When using the classic Borland compiler in C++Builder XE3 and later, you can use boost::regex instead of std::regex. While std::regex as defined in TR1 and C++11 defines pretty much the same operations and classes as boost::regex, there are a number of important differences in the actual regex flavor. Most importantly the ECMAScript regex syntax in Boost adds a number of features borrowed from Perl that aren't part of the ECMAScript standard and that aren't implemented in the Dinkumware library.

Six Regular Expression Flavors

Six different regular expression flavors or grammars are defined in std::regex_constants:

Most C++ references talk as if C++11 implements regular expressions as defined in the ECMA-262v3 and POSIX standards. But in reality the C++ implementation is very loosely based these standards. The syntax is quite close. The only significant differences are that std::regex supports POSIX classes even in ECMAScript mode, and that it is a bit peculiar about which characters must be escaped (like curly braces and closing square brackets) and which must not be escaped (like letters).

But there are important differences in the actual behavior of this syntax. The caret and dollar always match at embedded line breaks in std::regex, while in JavaScript and POSIX this is an option. Backreferences to non-participating groups fail to match as in most regex flavors, while in JavaScript they find a zero-length match. In JavaScript, \d and \w are ASCII-only while \s matches all Unicode whitespace. This is odd, but all modern browsers follow the spec. In std::regex all the shorthands are ASCII-only when using strings of char. In Visual C++, but not in C++Builder, they support Unicode when using strings of wchar_t. The POSIX classes also match non-ASCII characters when using wchar_t in Visual C++, but do not consistently include all the Unicode characters that one would expect.

In practice, you'll mostly use the ECMAScript grammar. It's the default grammar and offers far more features that the other grammars. Whenever the tutorial on this website mentions std::regex without mentioning any grammars then what is written applies to the ECMAScript grammar and may or may not apply to any of the other grammars. You'll really only use the other grammars if you want to reuse existing regular expressions from old POSIX code or UNIX scripts.

Creating a Regular Expression Object

Before you can use a regular expression, you have to create an object of the template class std::basic_regex. You can easily do this with the std::regex instantiation of this template class if your subject is an array of char or an std::string object. Use the std::wregex instantiation if your subject is an array of wchar_t of an std::wstring object.

Pass your regex as a string as the first parameter to the constructor. If you want to use a regex flavor other than ECMAScript, pass the appropriate constant as a second parameter. You can "or" this constant with std::regex_constants::icase to make the regex case insensitive. You can also "or" it with std::regex_constants::nosubs to turn all capturing groups into non-capturing groups, which makes your regex more efficient if you only care about the overall regex match and don't want to extract text matched by any of the capturing groups.

Finding a Regex Match

Call std::regex_search() with your subject string as the first parameter and the regex object as the second parameter to check whether your regex can match any part of the string. Call std::regex_match() with the same parameters if you want to check whether your regex can match the entire subject string. Since std::regex lacks anchors that exclusively match at the start and end of the string, you have to call regex_match() when using a regex to validate user input.

Both regex_search() and regex_match() return just true or false. To get the part of the string matched by regex_search(), or to get the parts of the string matched by capturing groups when using either function, you need to pass an object of the template class std::match_results as the second parameter. The regex object then becomes the third parameter. Create this object using the default constructor of one of these four template instantiations:

std::cmatch when your subject is an array of char

std::smatch when your subject is an std::string object

std::wcmatch when your subject is an array of wchar_t

std::wsmatch when your subject is an std::wstring object

When the function call returns true, you can call the str(), position(), and length() member functions of the match_results object to get the text that was matched, or the starting position and its length of the match relative to the subject string. Call these member functions without a parameter or with 0 as the parameter to get the overall regex match. Call them passing 1 or greater to get the match of a particular capturing group. The size() member function indicates the number of capturing groups plus one for the overall match. Thus you can pass a value up to size()-1 to the other three member functions.

Putting it all together, we can get the text matched by the first capturing group like this:

Finding All Regex Matches

To find all regex matches in a string, you need to use an iterator. Construct an object of the template class std::regex_iterator using one of these four template instantiations:

std::cregex_iterator when your subject is an array of char

std::sregex_iterator when your subject is an std::string object

std::wcregex_iterator when your subject is an array of wchar_t

std::wsregex_iterator when your subject is an std::wstring object

Construct one object by calling the constructor with three parameters: a string iterator indicating the starting position of the search, a string iterator indicating the ending position of the search, and the regex object. If there are any matches to be found, the object will hold the first match when constructed. Construct another iterator object using the default constructor to get an end-of-sequence iterator. You can compare the first object to the second to determine whether there are any further matches. As long as the first object is not equal to the second, you can dereference the first object to get a match_results object.

Replacing All Matches

To replace all matches in a string, call std::regex_replace() with your subject string as the first parameter, the regex object as the second parameter, and the string with the replacement text as the third parameter. The function returns a new string with the replacements applied.

The replacement string syntax is similar but not identical to that of JavaScript. The same replacement string syntax is used regardless of which regex syntax or grammar you are using. You can use $& or $0 to insert the whole regex match and $1 through $9 to insert the text matched by the first nine capturing groups. There is no way to insert the text matched by groups 10 or higher. $10 and higher are always replaced with nothing, and $9 and lower are replaced with nothing if there are fewer capturing groups in the regex than the requested number. $` (dollar backtick) is the part of the string to the left of the match, and $' (dollar quote) is the part of the string to the right of the match.

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!