The Perl regular expression syntax is based on that used by the programming
language Perl . Perl regular expressions are the default behavior in Boost.Regex
or you can pass the flag perl to the basic_regex constructor, for example:

A section beginning ( and ending )
acts as a marked sub-expression. Whatever matched the sub-expression is split
out in a separate field by the matching algorithms. Marked sub-expressions
can also repeated, or referred to by a back-reference.

A marked sub-expression is useful to lexically group part of a regular expression,
but has the side-effect of spitting out an extra field in the result. As
an alternative you can lexically group part of a regular expression, without
generating a marked sub-expression by using (?: and )
, for example (?:ab)+ will repeat ab
without splitting out any separate sub-expressions.

The normal repeat operators are "greedy", that is to say they will
consume as much input as possible. There are non-greedy versions available
that will consume as little input as possible while still producing a match.

*? Matches the previous atom zero or more times, while
consuming as little input as possible.

+? Matches the previous atom one or more times, while
consuming as little input as possible.

?? Matches the previous atom zero or one times, while
consuming as little input as possible.

{n,}? Matches the previous atom n or more times, while
consuming as little input as possible.

{n,m}? Matches the previous atom between n and m times,
while consuming as little input as possible.

By default when a repeated patten does not match then the engine will backtrack
until a match is found. However, this behaviour can sometime be undesireable
so there are also "pocessive" repeats: these match as much as possible
and do not then allow backtracking if the rest of the expression fails to
match.

*+ Matches the previous atom zero or more times, while
giving nothing back.

++ Matches the previous atom one or more times, while
giving nothing back.

?+ Matches the previous atom zero or one times, while
giving nothing back.

{n,}+ Matches the previous atom n or more times, while
giving nothing back.

For example [a-c] will match any single character in the
range 'a' to 'c'. By default, for Perl regular expressions, a character x
is within the range y to z, if the code point of the character lies within
the codepoints of the endpoints of the range. Alternatively, if you set the
collate
flag when constructing the regular expression, then ranges are locale
sensitive.

An expression of the form [[.col.]] matches the collating
element col. A collating element is any single character,
or any sequence of characters that collates as a single unit. Collating elements
may also be used as the end point of a range, for example: [[.ae.]-c]
matches the character sequence "ae", plus any single character
in the range "ae"-c, assuming that "ae" is treated as
a single collating element in the current locale.

As an extension, a collating element may also be specified via it's symbolic name, for example:

An expression of the form [[=col=]], matches any character
or collating element whose primary sort key is the same as that for collating
element col, as with collating elements the name col
may be a symbolic name.
A primary sort key is one that ignores case, accentation, or locale-specific
tailorings; so for example [[=a=]] matches
any of the characters: a, À, Á, Â, Ã, Ä, Å, A, à, á, â, ã, ä and å. Unfortunately implementation
of this is reliant on the platform's collation and localisation support;
this feature can not be relied upon to work portably across all platforms,
or even all locales on one platform.

All the escape sequences that match a single character, or a single character
class are permitted within a character class definition. For example [\[\]] would match either of [ or ]
while [\W\d]
would match any character that is either a "digit", or
is not a "word" character.

Any escaped character x, if x is
the name of a character class shall match any character that is a member
of that class, and any escaped character X, if x
is the name of a character class, shall match any character not in that class.

The sequence \G matches only at the end of the last match
found, or at the start of the text being matched if no previous match was
found. This escape useful if you're iterating over the matches contained
within a text, and you want each subsequence match to start where the last
one ended.

The escape sequence \Q begins a "quoted sequence":
all the subsequent characters are treated as literals, until either the end
of the regular expression or \E is found. For example the expression: \Q\*+\Ea+
would match either of:

\C Matches a single code point: in Boost regex this has
exactly the same effect as a "." operator. \X
Matches a combining character sequence: that is any non-combining character
followed by a sequence of zero or more combining characters.

\K Resets the start location of $0 to the current text
position: in other words everything to the left of \K is "kept back"
and does not form part of the regular expression match. $` is updated accordingly.

For example foo\Kbar matched against the text "foobar"
would return the match "bar" for $0 and "foo" for $`.
This can be used to simulate variable width lookbehind assertions.

Which can be then be refered to by the name NAME. Alternatively
you can delimit the name using 'NAME' as in:

(?'NAME'expression)

These named subexpressions can be refered to in a backreference using either
\g{NAME} or \k<NAME> and can
also be refered to by name in a Perl
format string for search and replace operations, or in the match_results member functions.

(?imsx-imsx ... ) alters which of the perl modifiers are
in effect within the pattern, changes take effect from the point that the
block is first seen and extend to any enclosing ). Letters
before a '-' turn that perl modifier on, letters afterward, turn it off.

(?|pattern) resets the subexpression count at the start
of each "|" alternative within pattern.

The sub-expression count following this construct is that of whichever branch
had the largest number of sub-expressions. This construct is useful when
you want to capture one of a number of alternative matches in a single sub-expression
index.

In the following example the index of each sub-expression is shown below
the expression:

Lookahead is typically used to create the logical AND of two regular expressions,
for example if a password must contain a lower case letter, an upper case
letter, a punctuation symbol, and be at least 6 characters long, then the
expression:

(?>pattern)pattern is matched
independently of the surrounding patterns, the expression will never backtrack
into pattern. Independent sub-expressions are typically
used to improve performance; only the best possible match for pattern will
be considered, if this doesn't allow the expression as a whole to match then
no match is found at all.

(?(condition)yes-pattern|no-pattern) attempts to match
yes-pattern if the condition is
true, otherwise attempts to match no-pattern.

(?(condition)yes-pattern) attempts to match yes-pattern
if the condition is true, otherwise fails.

condition may be either: a forward lookahead assert,
the index of a marked sub-expression (the condition becomes true if the sub-expression
has been matched), or an index of a recursion (the condition become true
if we are executing directly inside the specified recursion).

(?(R)yes-pattern|no-pattern) Executes yes-pattern
if we are executing inside a recursion, otherwise executes no-pattern.

(?(RN)yes-pattern|no-pattern) Executes
yes-pattern if we are executing inside a recursion
to sub-expression N, otherwise executes no-pattern.

(?(DEFINE)never-exectuted-pattern) Defines a block of
code that is never executed and matches no characters: this is usually
used to define one or more named sub-expressions which are refered to from
elsewhere in the pattern.

If you view the regular expression as a directed (possibly cyclic) graph,
then the best match found is the first match found by a depth-first-search
performed on that graph, while matching the input text.

Alternatively:

The best match found is the leftmost
match, with individual elements matched as follows;

Construct

What gets matched

AtomA AtomB

Locates the best match for AtomA that has a
following match for AtomB.

Expression1 | Expression2

If Expresion1 can be matched then returns that
match, otherwise attempts to match Expression2.

S{N}

Matches S repeated exactly N times.

S{N,M}

Matches S repeated between N and M times, and as many times as possible.

S{N,M}?

Matches S repeated between N and M times, and as few times as possible.

S?, S*, S+

The same as S{0,1}, S{0,UINT_MAX},
S{1,UINT_MAX} respectively.

S??, S*?, S+?

The same as S{0,1}?, S{0,UINT_MAX}?,
S{1,UINT_MAX}? respectively.

(?>S)

Matches the best match for S, and only that.

(?=S), (?<=S)

Matches only the best match for S (this is only
visible if there are capturing parenthesis within S).

(?!S), (?<!S)

Considers only whether a match for S exists or not.

(?(condition)yes-pattern | no-pattern)

If condition is true, then only yes-pattern is considered, otherwise
only no-pattern is considered.

There are a variety
of flags that may be combined with the perl option
when constructing the regular expression, in particular note that the newline_alt
option alters the syntax, while the collate, nosubs
and icase options modify how the case and locale sensitivity
are to be applied.