This section describes how to use xpressive to accomplish text manipulation
and parsing tasks. If you are looking for detailed information regarding specific
components in xpressive, check the Reference
section.

Introduction

What is xpressive?

xpressive is an object-oriented regular expression library. Regular expressions
(regexes) can be written as strings that are parsed dynamically at runtime
(dynamic regexes), or as expression templates that are parsed at compile-time
(static regexes). Dynamic regexes have the advantage that they can be accepted
from the user as input at runtime or read from an initialization file. Static
regexes have several advantages. Since they are C++ expressions instead of
strings, they can be syntax-checked at compile-time. Also, they can refer
to other regexes and to themselves, giving static regexes the power of context-free
grammars. Finally, since they are statically bound, the compiler can generate
faster code for static regexes.

xpressive's dual nature is unique and powerful. Static xpressive is a bit
like the Spirit Parser Framework.
Like Spirit, you can build
grammars with static regexes using expression templates. (Unlike Spirit,
xpressive does exhaustive backtracking, trying every possibility to find
a match for your pattern.) Dynamic xpressive is a bit like Boost.Regex.
In fact, xpressive's interface should be familiar to anyone who has used
Boost.Regex. xpressive's innovation
comes from allowing you to mix and match static and dynamic regexes in the
same program, and even in the same expression! You can embed a dynamic regex
in a static regex, and the dynamic regex will participate fully in the search,
back-tracking as needed to make the match succeed.

The first thing you'll notice about the code is that all the types in xpressive
live in the boost::xpressive namespace.

Note

Most of the rest of the examples in this document will leave off the usingnamespaceboost::xpressive;
directive. Just pretend it's there.

Next, you'll notice the type of the regular expression object is sregex. If you are familiar with Boost.Regex, this is different than what you
are used to. The "s"
in "sregex" stands
for "string", indicating
that this regex can be used to find patterns in std::string
objects. I'll discuss this difference and its implications in detail later.

Notice how the regex object is initialized:

sregexrex=sregex::compile("(\\w+) (\\w+)!");

To create a regular expression object from a string, you must call a factory
method such as basic_regex::compile().
This is another area in which xpressive differs from other object-oriented
regular expression libraries. Other libraries encourage you to think of a
regular expression as a kind of string on steroids. In xpressive, regular
expressions are not strings; they are little programs in a domain-specific
language. Strings are only one representation of that
language. Another representation is an expression template. For example,
the above line of code is equivalent to the following:

sregexrex=(s1=+_w)>>' '>>(s2=+_w)>>'!';

This describes the same regular expression, except it uses the domain-specific
embedded language defined by static xpressive.

As you can see, static regexes have a syntax that is noticeably different
than standard Perl syntax. That is because we are constrained by C++'s syntax.
The biggest difference is the use of >>
to mean "followed by". For instance, in Perl you can just put sub-expressions
next to each other:

abc

But in C++, there must be an operator separating sub-expressions:

a>>b>>c

In Perl, parentheses () have
special meaning. They group, but as a side-effect they also create back-references
like $1 and $2. In C++, there is no
way to overload parentheses to give them side-effects. To get the same effect,
we use the special s1, s2, etc. tokens. Assign to one to create
a back-reference (known as a sub-match in xpressive).

You'll also notice that the one-or-more repetition operator + has moved from postfix to prefix position.
That's because C++ doesn't have a postfix +
operator. So:

Installing xpressive

Getting xpressive

There are two ways to get xpressive. The first is by downloading xpressive.zip
at the Boost
File Vault in the "Strings - Text Processing" directory.
In addition to the source code and the Boost license, this archive contains
a copy of this documentation in PDF format.

The second way is through anonymous CVS via the boost project on SourceForge.net.
Just go to http://sf.net/projects/boost
and follow the instructions there for anonymous CVS access.

Building with xpressive

xpressive is a header-only template library, which means you don't need to
alter your build scripts or link to any separate lib file to use it. All
you need to do is #include<boost/xpressive/xpressive.hpp>.
If you are only using static regexes, you can improve compile times by only
including xpressive_static.hpp. Likewise,
you can include xpressive_dynamic.hpp if
you only plan on using dynamic regexes.

Requirements

xpressive depends on Boost. You can download the latest version of the Boost
libraries from http://boost.org. xpressive
requires Boost version 1.32 or higher.

Supported Compilers

Currently, Boost.Xpressive is known to work on the following compilers:

match_results<>
contains the results of a regex_match()
or regex_search()
operation. It acts like a vector of sub_match<>
objects. A sub_match<>
object contains a marked sub-expression (also known as a back-reference
in Perl). It is basically just a pair of iterators representing the
begin and end of the marked sub-expression.

Checks
to see if a string matches a regex. For regex_match()
to succeed, the whole string must match the regex,
from beginning to end. If you give regex_match()
a match_results<>,
it will write into it any marked sub-expressions it finds.

Searches
a string to find a sub-string that matches the regex. regex_search()
will try to find a match at every position in the string, starting
at the beginning, and stopping when it finds a match or when the string
is exhausted. As with regex_match(),
if you give regex_search()
a match_results<>,
it will write into it any marked sub-expressions it finds.

Given
an input string, a regex, and a substitution string, regex_replace()
builds a new string by replacing those parts of the input string that
match the regex with the substitution string. The substitution string
can contain references to marked sub-expressions.

Like
regex_iterator<>,
except dereferencing a regex_token_iterator<>
returns a string. By default, it will return the whole sub-string that
the regex matched, but it can be configured to return any or all of
the marked sub-expressions one at a time, or even the parts of the
string that didn't match the regex.

Now that you know a bit about the tools xpressive provides, you can pick
the right tool for you by answering the following two questions:

What iterator type will you use to traverse your data?

What do you want to do to your data?

Know Your Iterator Type

Most of the classes in xpressive are templates that are parameterized on
the iterator type. xpressive defines some common typedefs to make the job
of choosing the right types easier. You can use the table below to find the
right types based on the type of your iterator.

You should notice the systematic naming convention. Many of these types are
used together, so the naming convention helps you to use them consistently.
For instance, if you have a sregex,
you should also be using a smatch.

If you are not using one of those four iterator types, then you can use the
templates directly and specify your iterator type.

Know Your Task

Do you want to find a pattern once? Many times? Search and replace? xpressive
has tools for all that and more. Below is a quick reference:

Creating a Regex Object

When using xpressive, the first thing you'll do is create a basic_regex<>
object. This section goes over the nuts and bolts of building a regular expression
in the two dialects xpressive supports: static and dynamic.

Static Regexes

Overview

The feature that really sets xpressive apart from other C/C++ regular expression
libraries is the ability to author a regular expression using C++ expressions.
xpressive achieves this through operator overloading, using a technique
called expression templates to embed a mini-language
dedicated to pattern matching within C++. These "static regexes"
have many advantages over their string-based brethren. In particular, static
regexes:

are syntax-checked at compile-time; they will never fail at run-time
due to a syntax error.

can naturally refer to other C++ data and code, including other regexes,
making it possible to build grammars out of regular expressions and bind
user-defined actions that execute when parts of your regex match.

are statically bound for better inlining and optimization. Static regexes
require no state tables, virtual functions, byte-code or calls through
function pointers that cannot be resolved at compile time.

are not limited to searching for patterns in strings. You can declare
a static regex that finds patterns in an array of integers, for instance.

Since we compose static regexes using C++ expressions, we are constrained
by the rules for legal C++ expressions. Unfortunately, that means that
"classic" regular expression syntax cannot always be mapped cleanly
into C++. Rather, we map the regex constructs, picking
new syntax that is legal C++.

Construction and Assignment

You create a static regex by assigning one to an object of type basic_regex<>.
For instance, the following defines a regex that can be used to find patterns
in objects of type std::string:

sregexre='$'>>+_d>>'.'>>_d>>_d;

Assignment works similarly.

Character and String Literals

In static regexes, character and string literals match themselves. For
instance, in the regex above, '$'
and '.' match the characters
'$' and '.'
respectively. Don't be confused by the fact that $ and
. are meta-characters in Perl. In xpressive, literals
always represent themselves.

When using literals in static regexes, you must take care that at least
one operand is not a literal. For instance, the following are not
valid regexes:

sregexre1='a'>>'b';// ERROR!
sregexre2=+'a';// ERROR!

The two operands to the binary >>
operator are both literals, and the operand of the unary + operator is also a literal, so these statements
will call the native C++ binary right-shift and unary plus operators, respectively.
That's not what we want. To get operator overloading to kick in, at least
one operand must be a user-defined type. We can use xpressive's as_xpr()
helper function to "taint" an expression with regex-ness, forcing
operator overloading to find the correct operators. The two regexes above
should be written as:

sregexre1=as_xpr('a')>>'b';// OK
sregexre2=+as_xpr('a');// OK

Sequencing and Alternation

As you've probably already noticed, sub-expressions in static regexes must
be separated by the sequencing operator, >>.
You can read this operator as "followed by".

// Match an 'a' followed by a digit
sregexre='a'>>_d;

Alternation works just as it does in Perl with the |
operator. You can read this operator as "or". For example:

// match a digit character or a word character one or more times
sregexre=+(_d|_w);

Grouping and Captures

In Perl, parentheses () have
special meaning. They group, but as a side-effect they also create back-references
like $1 and $2. In C++, parentheses
only group -- there is no way to give them side-effects. To get the same
effect, we use the special s1,
s2, etc. tokens. Assigning
to one creates a back-reference. You can then use the back-reference later
in your expression, like using \1 and \2
in Perl. For example, consider the following regex, which finds matching
HTML tags:

"<(\\w+)>.*?</\\1>"

In static xpressive, this would be:

'<'>>(s1=+_w)>>'>'>>-*_>>"</">>s1>>'>'

Notice how you capture a back-reference by assigning to s1,
and then you use s1 later
in the pattern to find the matching end tag.

Tip

Grouping without capturing a back-reference

In xpressive, if you just want grouping without capturing a back-reference,
you can just use () without
s1. That is the equivalent
of Perl's (?:) non-capturing grouping construct.

Case-Insensitivity and Internationalization

Perl lets you make part of your regular expression case-insensitive by
using the (?i:) pattern modifier. xpressive also has
a case-insensitivity pattern modifier, called icase.
You can use it as follows:

sregexre="this">>icase("that");

In this regular expression, "this"
will be matched exactly, but "that"
will be matched irrespective of case.

Case-insensitive regular expressions raise the issue of internationalization:
how should case-insensitive character comparisons be evaluated? Also, many
character classes are locale-specific. Which characters are matched by
digit and which are matched
by alpha? The answer depends
on the std::locale object the regular expression
object is using. By default, all regular expression objects use the global
locale. You can override the default by using the imbue() pattern modifier, as follows:

This regular expression will evaluate alpha
and digit according to
my_locale. See the section
on Localization
and Regex Traits for more information about how to customize the
behavior of your regexes.

Static xpressive Syntax Cheat Sheet

The table below lists the familiar regex constructs and their equivalents
in static xpressive.

Perl syntax vs. Static xpressive syntax

Perl

Static xpressive

Meaning

.

_

any
character (assuming Perl's /s modifier).

ab

a>>b

sequencing
of a and b sub-expressions.

a|b

a|b

alternation
of a and b sub-expressions.

(a)

(s1=a)

group
and capture a back-reference.

(?:a)

(a)

group
and do not capture a back-reference.

\1

s1

a
previously captured back-reference.

a*

*a

zero or more times,
greedy.

a+

+a

one or more times,
greedy.

a?

!a

zero or one time,
greedy.

a{n,m}

repeat<n,m>(a)

between n
and m times, greedy.

a*?

-*a

zero or more times,
non-greedy.

a+?

-+a

one or more times,
non-greedy.

a??

-!a

zero or one time,
non-greedy.

a{n,m}?

-repeat<n,m>(a)

between
n and m times, non-greedy.

^

bos

beginning
of sequence assertion.

$

eos

end
of sequence assertion.

\b

_b

word
boundary assertion.

\B

~_b

not word boundary
assertion.

\n

_n

literal
newline.

.

~_n

any character
except a literal newline (without Perl's /s modifier).

\r?\n|\r

_ln

logical
newline.

[^\r\n]

~_ln

any single character
not a logical newline.

\w

_w

a
word character, equivalent to set[alnum | '_'].

\W

~_w

not a word character,
equivalent to ~set[alnum | '_'].

\d

_d

a
digit character.

\D

~_d

not a digit character.

\s

_s

a
space character.

\S

~_s

not a space character.

[:alnum:]

alnum

an
alpha-numeric character.

[:alpha:]

alpha

an
alphabetic character.

[:blank:]

blank

a
horizontal white-space character.

[:cntrl:]

cntrl

a
control character.

[:digit:]

digit

a
digit character.

[:graph:]

graph

a
graphable character.

[:lower:]

lower

a
lower-case character.

[:print:]

print

a
printing character.

[:punct:]

punct

a
punctuation character.

[:space:]

space

a
white-space character.

[:upper:]

upper

an
upper-case character.

[:xdigit:]

xdigit

a hexadecimal
digit character.

[0-9]

range('0','9')

characters in range
'0' through '9'.

[abc]

as_xpr('a')|'b'|'c'

characters 'a', 'b',
or 'c'.

[abc]

(set='a','b','c')

same
as above

[0-9abc]

set[range('0','9')|'a'|'b'|'c']

characters
'a', 'b',
'c' or in range '0' through '9'.

[0-9abc]

set[range('0','9')|(set='a','b','c')]

same
as above

[^abc]

~(set='a','b','c')

not
characters 'a', 'b', or 'c'.

(?i:stuff)

icase(stuff)

match stuff
disregarding case.

(?>stuff)

keep(stuff)

independent sub-expression,
match stuff and turn off backtracking.

(?=stuff)

before(stuff)

positive look-ahead
assertion, match if before stuff but don't include
stuff in the match.

(?!stuff)

~before(stuff)

negative look-ahead
assertion, match if not before stuff.

(?<=stuff)

after(stuff)

positive look-behind
assertion, match if after stuff but don't include
stuff in the match. (stuff
must be constant-width.)

(?<!stuff)

~after(stuff)

negative look-behind
assertion, match if not after stuff. (stuff
must be constant-width.)

Dynamic Regexes

Overview

Static regexes are dandy, but sometimes you need something a bit more ...
dynamic. Imagine you are developing a text editor with a regex search/replace
feature. You need to accept a regular expression from the end user as input
at run-time. There should be a way to parse a string into a regular expression.
That's what xpressive's dynamic regexes are for. They are built from the
same core components as their static counterparts, but they are late-bound
so you can specify them at run-time.

Construction and Assignment

There are two ways to create a dynamic regex: with the basic_regex::compile()
function or with the regex_compiler<>
class template. Use basic_regex::compile()
if you want the default locale, syntax and semantics. Use regex_compiler<>
if you need to specify a different locale, or if you need more control
over the regex syntax and semantics than the syntax_option_type
enumeration gives you. (Editor's note: in xpressive v1.0, regex_compiler<>
does not support customization of the dynamic regex syntax and semantics.
It will in v2.0.)

Dynamic xpressive Syntax

Since the dynamic syntax is not constrained by the rules for valid C++
expressions, we are free to use familiar syntax for dynamic regexes. For
this reason, the syntax used by xpressive for dynamic regexes follows the
lead set by John Maddock's proposal
to add regular expressions to the Standard Library. It is essentially the
syntax standardized by ECMAScript,
with minor changes in support of internationalization.

Since the syntax is documented exhaustively elsewhere, I will simply refer
you to the existing standards, rather than duplicate the specification
here.

Customizing Dynamic xpressive Syntax

xpressive v1.0 has limited support for the customization of dynamic regex
syntax. The only customization allowed is what can be specified via the
syntax_option_type
enumeration.

I have planned some future work in this area
for v2.0, however. xpressive's design allows for powerful mechanisms
to customize the dynamic regex syntax. First, since the concept of
"regex" is separated from the concept of "regex compiler",
it will be possible to offer multiple regex compilers, each of which
accepts a different syntax. Second, since xpressive allows you to
build grammars using static regexes, it should be possible to build
a dynamic regex parser out of static regexes! Then, new dynamic regex
grammars can be created by cloning an existing regex grammar and
modifying or disabling individual grammar rules to suit your needs.

Internationalization

As with static regexes, dynamic regexes support internationalization by
allowing you to specify a different std::locale.
To do this, you must use regex_compiler<>.
The regex_compiler<>
class has an imbue()
function. After you have imbued a regex_compiler<>
object with a custom std::locale,
all regex objects compiled by that regex_compiler<>
will use that locale. For example:

This regex will use my_locale
when evaluating the intrinsic character sets "\\w"
and "\\d".

Matching and Searching

Overview

Once you have created a regex object, you can use the regex_match()
and regex_search()
algorithms to find patterns in strings. This page covers the basics of regex
matching and searching. In all cases, if you are familiar with how regex_match()
and regex_search()
in the Boost.Regex library work, xpressive's
versions work the same way.

Seeing if a String Matches a Regex

The regex_match()
algorithm checks to see if a regex matches a given input.

Warning

The regex_match()
algorithm will only report success if the regex matches the whole
input, from beginning to end. If the regex matches only a part
of the input, regex_match()
will return false. If you want to search through the string looking for
sub-strings that the regex matches, use the regex_search()
algorithm.

The input can be a std::string, a C-style null-terminated string
or a pair of iterators. In all cases, the type of the iterator used to traverse
the input sequence must match the iterator type used to declare the regex
object. (You can use the table in the Quick
Start to find the correct regex type for your iterator.)

std::stringstr("hello");sregexsre=bol>>+_w;// match_not_bol means that "bol" should not match at [begin,begin)
if(regex_match(str.begin(),str.end(),sre,regex_constants::match_not_bol)){// should never get here!!!
}

Click here
to see a complete example program that shows how to use regex_match().
And check the regex_match()
reference to see a complete list of the available overloads.

Searching for Matching Sub-Strings

Use regex_search()
when you want to know if an input sequence contains a sub-sequence that a
regex matches. regex_search()
will try to match the regex at the beginning of the input sequence and scan
forward in the sequence until it either finds a match or exhausts the sequence.

In all other regards, regex_search()
behaves like regex_match()(see
above). In particular, it can operate on std::string,
C-style null-terminated strings or iterator ranges. The same care must be
taken to ensure that the iterator type of your regex matches the iterator
type of your input sequence. As with regex_match(),
you can optionally provide a match_results<>
struct to receive the results of the search, and a match_flag_type
bitmask to control how the match is evaluated.

Click here
to see a complete example program that shows how to use regex_search().
And check the regex_search()
reference to see a complete list of the available overloads.

Accessing Results

Overview

Sometimes, it is not enough to know simply whether a regex_match()
or regex_search()
was successful or not. If you pass an object of type match_results<>
to regex_match()
or regex_search(),
then after the algorithm has completed successfully the match_results<>
will contain extra information about which parts of the regex matched which
parts of the sequence. In Perl, these sub-sequences are called back-references,
and they are stored in the variables $1, $2,
etc. In xpressive, they are objects of type sub_match<>,
and they are stored in the match_results<>
structure, which acts as a vector of sub_match<>
objects.

match_results

So, you've passed a match_results<>
object to a regex algorithm, and the algorithm has succeeded. Now you want
to examine the results. Most of what you'll be doing with the match_results<>
object is indexing into it to access its internally stored sub_match<>
objects, but there are a few other things you can do with a match_results<>
object besides.

The table below shows how to access the information stored in a match_results<>
object named what.

match_results<> Accessors

Accessor

Effects

what.size()

Returns
the number of sub-matches, which is always greater than zero after
a successful match because the full match is stored in the zero-th
sub-match.

what[n]

Returns
the n-th sub-match.

what.length(n)

Returns
the length of the n-th sub-match. Same as what[n].length().

what.position(n)

Returns
the offset into the input sequence at which the n-th
sub-match begins.

what.str(n)

Returns
a std::basic_string<>
constructed from the n-th sub-match. Same as
what[n].str().

what.prefix()

Returns
a sub_match<>
object which represents the sub-sequence from the beginning of the
input sequence to the start of the full match.

what.suffix()

Returns
a sub_match<>
object which represents the sub-sequence from the end of the full match
to the end of the input sequence.

Since it inherits publicaly from std::pair<>, sub_match<>
has first and second data members of type BidirectionalIterator. These are the beginning
and end of the sub-sequence this sub_match<>
represents. sub_match<>
also has a Boolean matched
data member, which is true if this sub_match<>
participated in the full match.

The following table shows how you might access the information stored in
a sub_match<>
object called sub.

sub_match<> Accessors

Accessor

Effects

sub.length()

Returns
the length of the sub-match. Same as std::distance(sub.first,sub.second).

sub.str()

Returns
a std::basic_string<>
constructed from the sub-match. Same as std::basic_string<char_type>(sub.first,sub.second).

sub.compare(str)

Performs
a string comparison between the sub-match and str,
where str can be a
std::basic_string<>,
C-style null-terminated string, or another sub-match. Same as sub.str().compare(str).

Results Invalidation

Results are stored as iterators into the input sequence. Anything which invalidates
the input sequence will invalidate the match results. For instance, if you
match a std::string object, the results are only valid
until your next call to a non-const member function of that std::string
object. After that, the results held by the match_results<>
object are invalid. Don't use them!

String Substitutions

Regular expressions are not only good for searching text; they're good at
manipulating it. And one of the most common text manipulation
tasks is search-and-replace. xpressive provides the regex_replace()
algorithm for searching and replacing.

regex_replace()

Performing search-and-replace using regex_replace()
is simple. All you need is an input sequence, a regex object, and a format
string. There are two versions of the regex_replace()
algorithm. The first accepts the input sequence as std::basic_string<> and returns the result in a new
std::basic_string<>.
The second accepts the input sequence as a pair of iterators, and writes
the result into an output iterator. Below are examples of each.

std::stringinput("This is his face");sregexre=as_xpr("his");// find all occurrences of "his" ...
std::stringformat("her");// ... and replace them with "her"
// use the version of regex_replace() that operates on strings
std::stringoutput=regex_replace(input,re,format);std::cout<<output<<'\n';// use the version of regex_replace() that operates on iterators
std::ostream_iterator<char>out_iter(std::cout);regex_replace(out_iter,input.begin(),input.end(),re,format);

The above program prints out the following:

Ther is her face
Ther is her face

Notice that all the occurrences of "his"
have been replaced with "her".

Click here
to see a complete example program that shows how to use regex_replace().
And check the regex_replace()
reference to see a complete list of the available overloads.

The Format String

As with Perl, you can refer to sub-matches in the format string. The table
below shows the escape sequences xpressive recognizes in the format string.

Format Escape Sequences

Escape Sequence

Meaning

$1

the first sub-match

$2

the second sub-match (etc.)

$&

the full match

$`

the match prefix

$'

the match suffix

$$

a literal '$' character

Any other sequence beginning with '$'
simply represents itself. For example, if the format string were "$a" then "$a"
would be inserted into the output sequence.

Replace Options

The regex_replace()
algorithm takes an optional bitmask parameter to control the formatting.
The possible values of the bitmask are:

Format Flags

Flag

Meaning

format_first_only

Only
replace the first match, not all of them.

format_no_copy

Don't
copy the parts of the input sequence that didn't match the regex to
the output sequence.

format_literal

Treat
the format string as a literal; that is, don't recognize any escape
sequences.

Overview

You initialize a regex_token_iterator<>
with an input sequence, a regex, and some optional configuration parameters.
The regex_token_iterator<>
will use regex_search()
to find the first place in the sequence that the regex matches. When dereferenced,
the regex_token_iterator<>
returns a token in the form of a std::basic_string<>. Which string it returns depends
on the configuration parameters. By default it returns a string corresponding
to the full match, but it could also return a string corresponding to a particular
marked sub-expression, or even the part of the sequence that didn't
match. When you increment the regex_token_iterator<>,
it will move to the next token. Which token is next depends on the configuration
parameters. It could simply be a different marked sub-expression in the current
match, or it could be part or all of the next match. Or it could be the part
that didn't match.

As you can see, regex_token_iterator<>
can do a lot. That makes it hard to describe, but some examples should make
it clear.

Example 1: Simple Tokenization

std::stringinput("This is his face");sregexre=+_w;// find a word
// iterate over all the words in the input
sregex_token_iteratorbegin(input.begin(),input.end(),re),end;// write all the words to std::cout
std::ostream_iterator<std::string>out_iter(std::cout,"\n");std::copy(begin,end,out_iter);

This program displays the following:

This
is
his
face

Example 2: Simple Tokenization, Reloaded

This example also uses regex_token_iterator<>
to chop a sequence into a series of tokens consisting of words, but it uses
the regex as a delimiter. When we pass a -1 as the last parameter to the regex_token_iterator<>
constructor, it instructs the token iterator to consider as tokens those
parts of the input that didn't match the regex.

std::stringinput("This is his face");sregexre=+_s;// find white space
// iterate over all non-white space in the input. Note the -1 below:
sregex_token_iteratorbegin(input.begin(),input.end(),re,-1),end;// write all the words to std::cout
std::ostream_iterator<std::string>out_iter(std::cout,"\n");std::copy(begin,end,out_iter);

This program displays the following:

This
is
his
face

Example 3: Simple Tokenization, Revolutions

This example also uses regex_token_iterator<>
to chop a sequence containing a bunch of dates into a series of tokens consisting
of just the years. When we pass a positive integer N
as the last parameter to the regex_token_iterator<>
constructor, it instructs the token iterator to consider as tokens only the
N-th marked sub-expression of each
match.

std::stringinput("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");sregexre=sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})");// find a date
// iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression:
sregex_token_iteratorbegin(input.begin(),input.end(),re,3),end;// write all the words to std::cout
std::ostream_iterator<std::string>out_iter(std::cout,"\n");std::copy(begin,end,out_iter);

This program displays the following:

2003
1999
1981

Example 4: Not-So-Simple Tokenization

This example is like the previous one, except that instead of tokenizing
just the years, this program turns the days, months and years into tokens.
When we pass an array of integers {I,J,...}
as the last parameter to the regex_token_iterator<>
constructor, it instructs the token iterator to consider as tokens the I-th,
J-th, etc. marked sub-expression
of each match.

std::stringinput("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");sregexre=sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})");// find a date
// iterate over the days, months and years in the input
intconstsub_matches[]={2,1,3};// day, month, year
sregex_token_iteratorbegin(input.begin(),input.end(),re,sub_matches),end;// write all the words to std::cout
std::ostream_iterator<std::string>out_iter(std::cout,"\n");std::copy(begin,end,out_iter);

This program displays the following:

02
01
2003
23
04
1999
13
11
1981

The sub_matches array instructs
the regex_token_iterator<>
to first take the value of the 2nd sub-match, then the 1st sub-match, and
finally the 3rd. Incrementing the iterator again instructs it to use regex_search()
again to find the next match. At that point, the process repeats -- the token
iterator takes the value of the 2nd sub-match, then the 1st, et cetera.

Grammars and Nested Matches

Overview

One of the key benefits of representing regexes as C++ expressions is the
ability to easily refer to other C++ code and data from within the regex.
This enables programming idioms that are not possible with other regular
expression libraries. Of particular note is the ability for one regex to
refer to another regex, allowing you to build grammars out of regular expressions.
This section describes how to embed one regex in another by value and by
reference, how regex objects behave when they refer to other regexes, and
how to access the tree of results after a successful parse.

Embedding a Regex by Value

The basic_regex<>
object has value semantics. When a regex object appears on the right-hand
side in the definition of another regex, it is as if the regex were embedded
by value; that is, a copy of the nested regex is stored by the enclosing
regex. The inner regex is invoked by the outer regex during pattern matching.
The inner regex participates fully in the match, back-tracking as needed
to make the match succeed.

Consider a text editor that has a regex-find feature with a whole-word option.
You can implement this with xpressive as follows:

This line creates a new regex that embeds the old regex by value. Then, the
new regex is assigned back to the original regex. Since a copy of the old
regex was made on the right-hand side, this works as you might expect: the
new regex has the behavior of the old regex wrapped in begin- and end-word
assertions.

Note

Note that re=bow>>re>>eow does not define
a recursive regular expression, since regex objects embed by value by default.
The next section shows how to define a recursive regular expression by
embedding a regex by reference.

Embedding a Regex by Reference

If you want to be able to build recursive regular expressions and context-free
grammars, embedding a regex by value is not enough. You need to be able to
make your regular expressions self-referential. Most regular expression engines
don't give you that power, but xpressive does.

Tip

The theoretical computer scientists out there will correctly point out
that a self-referential regular expression is not "regular",
so in the strict sense, xpressive isn't really a regular
expression engine at all. But as Larry Wall once said, "the term
[regular expression]
has grown with the capabilities of our pattern matching engines, so I'm
not going to try to fight linguistic necessity here."

Consider the following code, which uses the by_ref() helper to define a recursive regular expression
that matches balanced, nested parentheses:

sregexparentheses;parentheses// A balanced set of parentheses ...
='('// is an opening parenthesis ...
>>// followed by ...
*(// zero or more ...
keep(+~(set='(',')'))// of a bunch of things that are not parentheses ...
|// or ...
by_ref(parentheses)// a balanced set of parentheses
)// (ooh, recursion!) ...
>>// followed by ...
')'// a closing parenthesis
;

Matching balanced, nested tags is an important text processing task, and
it is one that "classic" regular expressions cannot do. The by_ref()
helper makes it possible. It allows one regex object to be embedded in another
by reference. Since the right-hand side holds parentheses by reference, assigning the
right-hand side back to parentheses
creates a cycle, which will execute recursively.

Building a Grammar

Once we allow self-reference in our regular expressions, the genie is out
of the bottle and all manner of fun things are possible. In particular, we
can now build grammars out of regular expressions. Let's have a look at the
text-book grammar example: the humble calculator.

The regex expression defined
above does something rather remarkable for a regular expression: it matches
mathematical expressions. For example, if the input string were "foo 9*(10+3) bar", this pattern
would match "9*(10+3)".
It only matches well-formed mathematical expressions, where the parentheses
are balanced and the infix operators have two arguments each. Don't try this
with just any regular expression engine!

Note

There is no way for a dynamic regex to refer to other regexes, so they
can only be used as terminals in a grammar. Use static regexes for non-terminal
grammar rules.

Let's take a closer look at this regular expression grammar. Notice that
it is cyclic: expression
is implemented in terms of term,
which is implemented in terms of factor,
which is implemented in terms of group,
which is implemented in terms of expression,
closing the loop. In general, the way to define a cyclic grammar is to forward-declare
the regex objects and embed by reference those regular expressions that have
not yet been initialized. In the above grammar, there is only one place where
we need to reference a regex object that has not yet been initialized: the
definition of group. In that
place, we use by_ref()
to embed expression by reference.
In all other places, it is sufficient to embed the other regex objects by
value, since they have already been initialized and their values will not
change.

Tip

Embed by value if possible

In general, prefer embedding regular expressions by value rather than by
reference. It involves one less indirection, making your patterns match
a little faster. Besides, value semantics are simpler and will make your
grammars easier to reason about. Don't worry about the expense of "copying"
a regex. Each regex object shares its implementation with all of its copies.

Cyclic Patterns, Copying and Memory Management, Oh My!

The calculator example above raises a number of very complicated memory-management
issues. Each of the four regex objects refer to each other, some directly
and some indirectly, some by value and some by reference. What if we were
to return one of them from a function and let the others go out of scope?
What becomes of the references? The answer is that the regex objects are
internally reference counted, such that they keep their referenced regex
objects alive as long as they need them. So passing a regex object by value
is never a problem, even if it refers to other regex objects that have gone
out of scope.

Those of you who have dealt with reference counting are probably familiar
with its Achilles Heel: cyclic references. If regex objects are reference
counted, what happens to cycles like the one created in the calculator example?
Are they leaked? The answer is no, they are not leaked. The basic_regex<>
object has some tricky reference tracking code that ensures that even cyclic
regex grammars are cleaned up when the last external reference goes away.
So don't worry about it. Create cyclic grammars, pass your regex objects
around and copy them all you want. It is fast and efficient and guaranteed
not to leak or result in dangling references.

Nested Regexes and Sub-Match Scoping

Nested regular expressions raise the issue of sub-match scoping. If both
the inner and outer regex write to and read from the same sub-match vector,
chaos would ensue. The inner regex would stomp on the sub-matches written
by the outer regex. For example, what does this do?

sregexinner=sregex::compile("(.)\\1");sregexouter=(s1=_)>>inner>>s1;

The author probably didn't intend for the inner regex to overwrite the sub-match
written by the outer regex. The problem is particularly acute when the inner
regex is accepted from the user as input. The author has no way of knowing
whether the inner regex will stomp the sub-match vector or not. This is clearly
not acceptable.

Instead, what actually happens is that each invocation of a nested regex
gets its own scope. Sub-matches belong to that scope. That is, each nested
regex invocation gets its own copy of the sub-match vector to play with,
so there is no way for an inner regex to stomp on the sub-matches of an outer
regex. So, for example, the regex outer
defined above would match "ABBA",
as it should.

Nested Results

If nested regexes have their own sub-matches, there should be a way to access
them after a successful match. In fact, there is. After a regex_match()
or regex_search(),
the match_results<>
struct behaves like the head of a tree of nested results. The match_results<>
class provides a nested_results() member function that returns an ordered
sequence of match_results<>
structures, representing the results of the nested regexes. The order of
the nested results is the same as the order in which the nested regex objects
matched.

Take as an example the regex for balanced, nested parentheses we saw earlier:

Filtering Nested Results

Sometimes a regex will have several nested regex objects, and you want to
know which result corresponds to which regex object. That's where basic_regex<>::regex_id()
and match_results<>::regex_id()
come in handy. When iterating over the nested results, you can compare the
regex id from the results to the id of the regex object you're interested
in.

To make this a bit easier, xpressive provides a predicate to make it simple
to iterate over just the results that correspond to a certain nested regex.
It is called regex_id_filter_predicate,
and it is intended to be used with Boost.Iterator.
You can use it as follows:

sregexname=+alpha;sregexinteger=+_d;sregexre=*(*_s>>(name|integer));smatchwhat;std::stringstr("marsha 123 jan 456 cindy 789");if(regex_match(str,what,re)){smatch::nested_results_type::const_iteratorbegin=what.nested_results().begin();smatch::nested_results_type::const_iteratorend=what.nested_results().end();// declare filter predicates to select just the names or the integers
sregex_id_filter_predicatename_id(name.regex_id());sregex_id_filter_predicateinteger_id(integer.regex_id());// iterate over only the results from the name regex
std::for_each(boost::make_filter_iterator(name_id,begin,end),boost::make_filter_iterator(name_id,end,end),output_result);std::cout<<'\n';// iterate over only the results from the integer regex
std::for_each(boost::make_filter_iterator(integer_id,begin,end),boost::make_filter_iterator(integer_id,end,end),output_result);}

where output_results is a
simple function that takes a smatch
and displays the full match. Notice how we use the regex_id_filter_predicate
together with basic_regex<>::regex_id() and boost::make_filter_iterator() from the Boost.Iterator
to select only those results corresponding to a particular nested regex.
This program displays the following:

marsha
jan
cindy
123
456
789

Localization and Regex Traits

Overview

Matching a regular expression against a string often requires locale-dependent
information. For example, how are case-insensitive comparisons performed?
The locale-sensitive behavior is captured in a traits class. xpressive provides
three traits class templates: cpp_regex_traits<>, c_regex_traits<> and null_regex_traits<>. The first wraps a std::locale,
the second wraps the global C locale, and the third is a stub traits type
for use when searching non-character data. All traits templates conform to
the Regex
Traits Concept.

Setting the Default Regex Trait

By default, xpressive uses cpp_regex_traits<> for all patterns. This causes all
regex objects to use the global std::locale.
If you compile with BOOST_XPRESSIVE_USE_C_TRAITS
defined, then xpressive will use c_regex_traits<> by default.

Using Custom Traits with Dynamic Regexes

To create a dynamic regex that uses a custom traits object, you must use
regex_compiler<>.
The basic steps are shown in the following example:

The imbue()
pattern modifier must wrap the entire pattern. It is an error to imbue only part of a static regex. For
example:

// ERROR! Cannot imbue() only part of a regex
sregexerror=_w>>imbue(loc)(_w);

Searching Non-Character Data With null_regex_traits

With xpressive static regexes, you are not limitted to searching for patterns
in character sequences. You can search for patterns in raw bytes, integers,
or anything that conforms to the Char
Concept. The null_regex_traits<> makes it simple. It is a stub implementation
of the Regex
Traits Concept. It recognizes no character classes and does no case-sensitive
mappings.

For example, with null_regex_traits<>, you can write a static regex to
find a pattern in a sequence of integers as follows:

// some integral data to search
intconstdata[]={0,1,2,3,4,5,6};// create a null_regex_traits<> object for searching integers ...
null_regex_traits<int>nul;// imbue a regex object with the null_regex_traits ...
basic_regex<intconst*>rex=imbue(nul)(1>>+((set=2,3)|4)>>5);match_results<intconst*>what;// search for the pattern in the array of integers ...
regex_search(data,data+7,what,rex);assert(what[0].matched);assert(*what[0].first==1);assert(*what[0].second==6);

Tips 'N Tricks

Squeeze the most performance out of xpressive with these tips and tricks.

Use Static Regexes

On average, static regexes execute about 10 to 15% faster than their dynamic
counterparts. It's worth familiarizing yourself with the static regex dialect.

This is a corollary to the previous tip. If you are doing multiple searches,
you should prefer the regex algorithms that accept a match_results<>
object over the ones that don't, and you should reuse the same match_results<>
object each time. If you don't provide a match_results<>
object, a temporary one will be created for you and discarded when the algorithm
returns. Any memory cached in the object will be deallocated and will have
to be reallocated the next time.

xpressive provides overloads of the regex_match()
and regex_search()
algorithms that operate on C-style null-terminated strings. You should prefer
the overloads that take iterator ranges. When you pass a null-terminated
string to a regex algorithm, the end iterator is calculated immediately by
calling strlen. If you already
know the length of the string, you can avoid this overhead by calling the
regex algorithms with a [begin,end) pair.

Compile Patterns Once And Reuse Them

Compiling a regex (dynamic or static) is more expensive than executing a
match or search. If you have the option, prefer to compile a pattern into
a basic_regex<>
object once and reuse it rather than recreating it over and over.

Understand syntax_option_type::optimize

The optimize flag tells the
regex compiler to spend some extra time analyzing the pattern. It can cause
some patterns to execute faster, but it increases the time to compile the
pattern, and often increases the amount of memory consumed by the pattern.
If you plan to reuse your pattern, optimize
is usually a win. If you will only use the pattern once, don't use optimize.

Common Pitfalls

Keep the following tips in mind to avoid stepping in potholes with xpressive.

Create Grammars On A Single Thread

With static regexes, you can create grammars by nesting regexes inside one
another. When compiling the outer regex, both the outer and inner regex objects,
and all the regex objects to which they refer either directly or indirectly,
are modified. For this reason, it's dangerous for global regex objects to
participate in grammars. It's best to build regex grammars from a single
thread. Once built, the resulting regex grammar can be executed from multiple
threads without problems.

Beware Nested Quantifiers

This is a pitfall common to many regular expression engines. Some patterns
can cause exponentially bad performance. Often these patterns involve one
quantified term nested withing another quantifier, such as "(a*)*", although in many cases,
the problem is harder to spot. Beware of patterns that have nested quantifiers.

Concepts

CharT requirements

If type BidiIterT is used
as a template argument to basic_regex<>,
then CharT is iterator_traits<BidiIterT>::value_type. Type CharT
must have a trivial default constructor, copy constructor, assignment operator,
and destructor. In addition the following requirements must be met for objects;
c of type CharT,
c1 and c2
of type CharTconst,
and i of type int:

CharT Requirements

Expression

Return type

Assertion
/ Note / Pre- / Post-condition

CharTc

CharT

Default
constructor (must be trivial).

CharTc(c1)

CharT

Copy constructor
(must be trivial).

c1=c2

CharT

Assignment
operator (must be trivial).

c1==c2

bool

true if c1
has the same value as c2.

c1!=c2

bool

true if c1
and c2 are not equal.

c1<c2

bool

true if the value of c1
is less than c2.

c1>c2

bool

true if the value of c1
is greater than c2.

c1<=c2

bool

true if c1
is less than or equal to c2.

c1>=c2

bool

true if c1
is greater than or equal to c2.

intmax_ti=c1

int

CharT must be convertible
to an integral type.

CharTc(i);

CharT

CharT must be constructable from
an integral type.

Traits Requirements

In the following table X
denotes a traits class defining types and functions for the character container
type CharT; u is an object of type X;
v is an object of type constX;
p is a value of type constCharT*; I1
and I2 are InputIterators;
c is a value of type constCharT;
s is an object of type X::string_type;
cs is an object of type
constX::string_type;
b is a value of type bool; i
is a value of type int; F1 and F2
are values of type constCharT*;
loc is an object of type
X::locale_type; and ch
is an object of constchar.

Traits Requirements

Expression

Return type

Assertion / Note Pre / Post condition

X::char_type

CharT

The character
container type used in the implementation of class template basic_regex<>.

X::string_type

std::basic_string<CharT>
or std::vector<CharT>

X::locale_type

Implementation
defined

A copy constructible type that represents
the locale used by the traits class.

X::char_class_type

Implementation defined

A bitmask
type representing a particular character classification. Multiple values
of this type can be bitwise-or'ed together to obtain a new valid value.

X::hash(c)

unsignedchar

Yields a value
between 0 and UCHAR_MAX inclusive.

v.widen(ch)

CharT

Widens the specified char
and returns the resulting CharT.

v.in_range(r1,r2,c)

bool

For any characters r1
and r2, returns true if r1<=c&&c<=r2.
Requires that r1<=r2.

v.in_range_nocase(r1,r2,c)

bool

For characters r1
and r2, returns true if there is some character d for which v.translate_nocase(d)==v.translate_nocase(c)
and r1<=d&&d<=r2. Requires that r1<=r2.

v.translate(c)

X::char_type

Returns
a character such that for any character d
that is to be considered equivalent to c
then v.translate(c)==v.translate(d).

v.translate_nocase(c)

X::char_type

For all
characters C that are
to be considered equivalent to c
when comparisons are to be performed without regard to case, then
v.translate_nocase(c)==v.translate_nocase(C).

v.transform(F1,F2)

X::string_type

Returns
a sort key for the character sequence designated by the iterator range
[F1,F2) such that if the character sequence
[G1,G2) sorts before the character sequence
[H1,H2) then v.transform(G1,G2)<v.transform(H1,H2).

v.transform_primary(F1,F2)

X::string_type

Returns
a sort key for the character sequence designated by the iterator range
[F1,F2) such that if the character sequence
[G1,G2) sorts before the character sequence
[H1,H2) when character case is not considered
then v.transform_primary(G1,G2)<v.transform_primary(H1,H2).

v.lookup_classname(F1,F2)

X::char_class_type

Converts
the character sequence designated by the iterator range [F1,F2) into a bitmask type that can subsequently
be passed to isctype.
Values returned from lookup_classname
can be safely bitwise or'ed together. Returns 0
if the character sequence is not the name of a character class recognized
by X. The value returned
shall be independent of the case of the characters in the sequence.

v.lookup_collatename(F1,F2)

X::string_type

Returns
a sequence of characters that represents the collating element consisting
of the character sequence designated by the iterator range [F1,F2). Returns an empty string if the character
sequence is not a valid collating element.

v.isctype(c,v.lookup_classname(F1,F2))

bool

Returns true
if character c is a
member of the character class designated by the iterator range [F1,F2), false
otherwise.

v.value(c,i)

int

Returns the value represented by the digit c
in base i if the character
c is a valid digit
in base i; otherwise
returns -1.
[Note: the value of i
will only be 8, 10, or 16.
-end note]

u.imbue(loc)

X::locale_type

Imbues
u with the locale
loc, returns the previous
locale used by u.

v.getloc()

X::locale_type

Returns
the current locale used by v.

Acknowledgements

This section is adapted from the equivalent page in the Boost.Regex
documentation and from the proposal
to add regular expressions to the Standard Library.

Examples

Below you can find six complete sample programs.

See if a whole string matches a regex

This is the example from the Introduction. It is reproduced here for your
convenience.

See if a string contains a sub-string that matches a regex

Notice in this example how we use custom mark_tags
to make the pattern more readable. We can use the mark_tags
later to index into the match_results<>.

#include<iostream>#include<boost/xpressive/xpressive.hpp>usingnamespaceboost::xpressive;intmain(){charconst*str="I was born on 5/30/1973 at 7am.";// define some custom mark_tags with names more meaningful than s1, s2, etc.
mark_tagday(1),month(2),year(3),delim(4);// this regex finds a date
cregexdate=(month=repeat<1,2>(_d))// find the month ...
>>(delim=(set='/','-'))// followed by a delimiter ...
>>(day=repeat<1,2>(_d))>>delim// and a day followed by the same delimiter ...
>>(year=repeat<1,2>(_d>>_d));// and the year.
cmatchwhat;if(regex_search(str,what,date)){std::cout<<what[0]<<'\n';// whole match
std::cout<<what[day]<<'\n';// the day
std::cout<<what[month]<<'\n';// the month
std::cout<<what[year]<<'\n';// the year
std::cout<<what[delim]<<'\n';// the delimiter
}return0;}

Replace all sub-strings that match a regex

The following program finds dates in a string and marks them up with pseudo-HTML.

#include<iostream>#include<boost/xpressive/xpressive.hpp>usingnamespaceboost::xpressive;intmain(){std::stringstr("I was born on 5/30/1973 at 7am.");// essentially the same regex as in the previous example, but using a dynamic regex
sregexdate=sregex::compile("(\\d{1,2})([/-])(\\d{1,2})\\2((?:\\d{2}){1,2})");// As in Perl, $& is a reference to the sub-string that matched the regex
std::stringformat("<date>$&</date>");str=regex_replace(str,date,format);std::cout<<str<<'\n';return0;}

Find all the sub-strings that match a regex and step through them one at
a time

The following program finds the words in a wide-character string. It uses
wsregex_iterator. Notice
that dereferencing a wsregex_iterator
yields a wsmatch object.

#include<iostream>#include<boost/xpressive/xpressive.hpp>usingnamespaceboost::xpressive;intmain(){std::wstringstr(L"This is his face.");// find a whole word
wsregextoken=+alnum;wsregex_iteratorcur(str.begin(),str.end(),token);wsregex_iteratorend;for(;cur!=end;++cur){wsmatchconst&what=*cur;std::wcout<<what[0]<<L'\n';}return0;}

Split a string into tokens that each match a regex

The following program finds race times in a string and displays first the
minutes and then the seconds. It uses regex_token_iterator<>.

#include<iostream>#include<boost/xpressive/xpressive.hpp>usingnamespaceboost::xpressive;intmain(){std::stringstr("Eric: 4:40, Karl: 3:35, Francesca: 2:32");// find a race time
sregextime=sregex::compile("(\\d):(\\d\\d)");// for each match, the token iterator should first take the value of
// the first marked sub-expression followed by the value of the second
// marked sub-expression
intconstsubs[]={1,2};sregex_token_iteratorcur(str.begin(),str.end(),time,subs);sregex_token_iteratorend;for(;cur!=end;++cur){std::cout<<*cur<<'\n';}return0;}

Split a string using a regex as a delimiter

The following program takes some text that has been marked up with html and
strips out the mark-up. It uses a regex that matches an HTML tag and a regex_token_iterator<>
that returns the parts of the string that do not match
the regex.

#include<iostream>#include<boost/xpressive/xpressive.hpp>usingnamespaceboost::xpressive;intmain(){std::stringstr("Now <bold>is the time <i>for all good men</i> to come to the aid of their</bold> country.");// find a HTML tag
sregexhtml='<'>>optional('/')>>+_w>>'>';// the -1 below directs the token iterator to display the parts of
// the string that did NOT match the regular expression.
sregex_token_iteratorcur(str.begin(),str.end(),html,-1);sregex_token_iteratorend;for(;cur!=end;++cur){std::cout<<'{'<<*cur<<'}';}std::cout<<'\n';return0;}

This program outputs the following:

{Now }{is the time }{for all good men}{ to come to the aid of their}{ country.}