=encoding utf8
=head1 NAME
Synopsis_05 - Regexes and Rules
=head1 AUTHORS
Damian Conway and
Allison Randal
=head1 VERSION
Maintainer: Patrick Michaud and
Larry Wall
Date: 24 Jun 2002
Last Modified: 20 Feb 2008
Number: 5
Version: 73
This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them I rather than "regular
expressions" because they haven't been regular expressions for a
long time, and we think the popular term "regex" is in the process of
becoming a technical term with a precise meaning of: "something you do
pattern matching with, kinda like a regular expression". On the other
hand, one of the purposes of the redesign is to make portions of
our patterns more amenable to analysis under traditional regular
expression and parser semantics, and that involves making careful
distinctions between which parts of our patterns and grammars are
to be treated as declarative, and which parts as procedural.
In any case, when referring to recursive patterns within a grammar,
the terms I and I are generally preferred over I.
=head1 New match result and capture variables
The underlying match result object is now available as the C
variable, which is implicitly lexically scoped. All user access to the
most recent match is through this variable, even when
it doesn't look like it. The individual capture variables (such as C,
C, etc.) are just elements of C.
By the way, unlike in Perl 5, the numbered capture variables now
start at C instead of C. See below.
=head1 Unchanged syntactic features
The following regex features use the same syntax as in Perl 5:
=over
=item *
Capturing: (...)
=item *
Repetition quantifiers: *, +, and ?
=item *
Alternatives: |
=item *
Backslash escape: \
=item *
Minimal matching suffix: ??, *?, +?
=back
While the syntax of C does not change, the default semantics do
change slightly. We are attempting to concoct a pleasing mixture
of declarative and procedural matching so that we can have the
best of both. In short, you need not write your own tokener for
a grammar because Perl will write one for you. See the section
below on "Longest-token matching".
=head1 Simplified lexical parsing of patterns
Unlike traditional regular expressions, Perl 6 does not require
you to memorize an arbitrary list of metacharacters. Instead it
classifies characters by a simple rule. All glyphs (graphemes)
whose base characters are either the underscore (C) or have
a Unicode classification beginning with 'L' (i.e. letters) or 'N'
(i.e. numbers) are always literal (i.e. self-matching) in regexes. They
must be escaped with a C to make them metasyntactic (in which
case that single alphanumeric character is itself metasyntactic,
but any immediately following alphanumeric character is not).
All other glyphs--including whitespace--are exactly the opposite:
they are always considered metasyntactic (i.e. non-self-matching) and
must be escaped or quoted to make them literal. As is traditional,
they may be individually escaped with C, but in Perl 6 they may
be also quoted as follows.
Sequences of one or more glyphs of either type (i.e. any glyphs at all)
may be made literal by placing them inside single quotes. (Double
quotes are also allowed, with the same interpolative semantics as
the current language in which the regex is lexically embedded.)
Quotes create a quantifiable atom, so while
moose*
quantifies only the 'e' and matches "mooseee", saying
'moose'*
quantifies the whole string and would match "moosemoose".
Here is a table that summarizes the distinctions:
Alphanumerics Non-alphanumerics Mixed
Literal glyphs a 1 _ \* \$ \. \\ \' K\-9\!
Metasyntax \a \1 \_ * $ . \ ' \K-\9!
Quoted glyphs 'a' '1' '_' '*' '$' '.' '\\' '\'' 'K-9!'
In other words, identifier glyphs are literal (or metasyntactic when
escaped), non-identifier glyphs are metasyntactic (or literal when
escaped), and single quotes make everything inside them literal.
Note, however, that not all non-identifier glyphs are currently
meaningful as metasyntax in Perl 6 regexes (e.g. C C C
C). It is more accurate to say that all unescaped non-identifier
glyphs are I metasyntax, and reserved for future use.
If you use such a sequence, a helpful compile-time error is issued
indicating that you either need to quote the sequence or define a new
operator to recognize it.
=head1 Modifiers
=over
=item *
The extended syntax (C) is no longer required...it's the default.
(In fact, it's pretty much mandatory--the only way to get back to
the old syntax is with the C/C modifier.)
=item *
There are no C or C modifiers (changes to the meta-characters
replace them - see below).
=item *
There is no C evaluation modifier on substitutions; instead use:
s/pattern/{ doit() }/
or:
s[pattern] = doit()
Instead of C say:
s/pattern/{ eval doit() }/
or:
s[pattern] = eval doit()
=item *
Modifiers are now placed as adverbs at the I of a match/substitution:
m:g:i/\s* (\w*) \s* ,?/;
Every modifier must start with its own colon. The delimiter must be
separated from the final modifier by whitespace if it would otherwise be taken
as an argument to the preceding modifier (which is true if and only if
the next character is a left parenthesis.)
=item *
The single-character modifiers also have longer versions:
:i :ignorecase
:b :basechar
:g :global
=item *
The C (or C) modifier causes case distinctions to be
ignored in its lexical scope, but not in its dynamic scope. That is,
subrules always use their own case settings.
The C variant may be used on a substitution to change the
substituted string to the same case pattern as the matched string.
If the pattern is matched without the C modifier, case
info is carried across on a character by character basis. If the
right string is longer than the left one, the case of the final
character is replicated. Titlecase is carried across if possible
regardless of whether the resulting letter is at the beginning of
a word or not; if there is no titlecase character available, the
corresponding uppercase character is used. (This policy can be
modified within a lexical scope by a language-dependent Unicode
declaration to substitute titlecase according to the orthographic
rules of the specified language.) Characters that carry no case
information leave their corresponding replacement character unchanged.
If the pattern is matched with C, then a slightly smarter
algorithm is used which attempts to determine if there is a uniform
capitalization policy over each matched word, and applies the same
policy to each replacement word. If there doesn't seem to be a uniform
policy on the left, the policy for each word is carried over word by
word, with the last pattern word replicated if necessary. If a word
does not appear to have a recognizable policy, the replacement word
is translated character for character as in the non-sigspace case.
Recognized policies include:
lc()
uc()
ucfirst(lc())
lcfirst(uc())
capitalize()
In any case, only the officially matched string part of the pattern
match counts, so any sort of lookahead or contextual matching is not
included in the analysis.
=item *
The C (or C) modifier scopes exactly like C
except that it ignores accents instead of case. It is equivalent
to taking each grapheme (in both target and pattern), converting
both to NFD (maximally decomposed) and then comparing the two base
characters (Unicode non-mark characters) while ignoring any trailing
mark characters. The mark characters are ignored only for the purpose
of determining the truth of the assertion; the actual text matched
includes all ignored characters, including any that follow the final
base character.
The C variant may be used on a substitution to change the
substituted string to the same accent pattern as the matched string.
Accent info is carried across on a character by character basis. If
the right string is longer than the left one, the remaining characters
are substituted without any modification. (Note that NFD/NFC distinctions
are usually immaterial, since Perl encapsulates that in grapheme mode.)
Under C the preceding rules are applied word by word.
=item *
The C (or C) modifier causes the pattern to continue
scanning from the specified position (defaulting to C):
m:c($p)/ pattern / # start scanning at position $p
Note that this does not automatically anchor the pattern to the starting
location. (Use C for that.) The pattern you supply to C
has an implicit C modifier.
String positions are of type C and should generally be treated
as opaque.
=item *
The C (or C) modifier causes the pattern to try to match only at
the specified string position:
m:pos($p)/ pattern / # match at position $p
If the argument is omitted, it defaults to C. (Unlike in
Perl 5, the string itself has no clue where its last match ended.)
All subrule matches are implicitly passed their starting position.
Likewise, the pattern you supply to a Perl macro's C
trait has an implicit C modifier.
Note that
m:c($p)/pattern/
is roughly equivalent to
m:p($p)/.*? /
=item *
The new C (C) modifier causes whitespace sequences
to be considered "significant"; they are replaced by a whitespace
matching rule, C<< >>. That is,
m:s/ next cmd '=' /
is the same as:
m/ next cmd '=' /
which is effectively the same as:
m/ \s* next \s+ cmd \s* '=' \s* /
But in the case of
m:s{(a|\*) (b|\+)}
or equivalently,
m { (a|\*) (b|\+) }
C<< >> can't decide what to do until it sees the data.
It still does the right thing. If not, define your own C<< ws >>
and C will use that.
In general you don't need to use C within grammars because
the parser rules automatically handle whitespace policy for you.
In this context, whitespace often includes comments, depending on
how the grammar chooses to define its whitespace rule. Although the
default C<< >> subrule recognizes no comment construct, any
grammar is free to override the rule. The C<< >> rule is not
intended to mean the same thing everywhere.
It's also possible to pass an argument to C specifying
a completely different subrule to apply. This can be any rule, it
doesn't have to match whitespace. When discussing this modifier, it is
important to distinguish the significant whitespace in the pattern from
the "whitespace" being matched, so we'll call the pattern's whitespace
I, and generally reserve I to indicate whatever
C<< >> matches in the current grammar. The correspondence
between sigspace and whitespace is primarily metaphorical, which is
why the correspondence is both useful and (potentially) confusing.
The C modifier is considered sufficiently important that
match variants are defined for them:
mm/match some words/ # same as m:sigspace
ss/match some words/replace those words/ # same as s:sigspace
=item *
New modifiers specify Unicode level:
m:bytes / .**2 / # match two bytes
m:codes / .**2 / # match two codepoints
m:graphs / .**2 / # match two language-independent graphemes
m:chars / .**2 / # match two characters at current max level
There are corresponding pragmas to default to these levels. Note that
the C modifier is always redundant because dot always matches
characters at the highest level allowed in scope. This highest level
may be identical to one of the other three levels, or it may be more
specific than C when a particular language's character rules
are in use. Note that you may not specify language-dependent character
processing without specifying I language you're depending on.
[Conjecture: the C modifier could take an argument specifying
which language's rules to use for this match.]
=item *
The new C/C modifier allows Perl 5 regex syntax to be
used instead. (It does not go so far as to allow you to put your
modifiers at the end.) For instance,
m:P5/(?mi)^(?:[a-z]|\d){1,2}(?=\s)/
is equivalant to the Perl 6 syntax:
m/ :i ^^ [ || \d ] ** 1..2 /
=item *
Any integer modifier specifies a count. What kind of count is
determined by the character that follows.
=item *
If followed by an C, it means repetition. Use C for the
general form. So
s:4x [ () '=' (\N+) $$] = "$0 => $1";
is the same as:
s:x(4) [ () '=' (\N+) $$] = "$0 => $1";
which is almost the same as:
s:c[ () '=' (\N+) $$] = "$0 => $1" for 1..4;
except that the string is unchanged unless all four matches are found.
However, ranges are allowed, so you can say C to change anywhere
from one to four matches.
=item *
If the number is followed by an C, C, C, or C

, it means
find the Ith occurrence. Use C for the general form. So
s:3rd/(\d+)/@data[$0]/;
is the same as
s:nth(3)/(\d+)/@data[$0]/;
which is the same as:
m/(\d+)/ && m:c/(\d+)/ && s:c/(\d+)/@data[$0]/;
Lists and junctions are allowed: C.
So are closures: C
=item *
With the new C (C) modifier, the current regex will
match at all possible character positions (including overlapping)
and return all matches in a list context, or a disjunction of matches
in a scalar context. The first match at any position is returned.
The matches are guaranteed to be returned in left-to-right order with
respect to the starting positions.
$str = "abracadabra";
if $str ~~ m:overlap/ a (.*) a / {
@substrings = @@(); # bracadabr cadabr dabr br
}
=item *
With the new C (C) modifier, the current regex will
match every possible way (including overlapping) and return all matches
in a list context, or a disjunction of matches in a scalar context.
The matches are guaranteed to be returned in left-to-right order with
respect to the starting positions. The order within each starting
position is not guaranteed and may depend on the nature of both the
pattern and the matching engine. (Conjecture: or we could enforce
backtracking engine semantics. Or we could guarantee no order at all
unless the pattern starts with "::" or some such to suppress DFAish
solutions.)
$str = "abracadabra";
if $str ~~ m:exhaustive/ a (.*?) a / {
say "@()"; # br brac bracad bracadabr c cad cadabr d dabr br
}
Note that the C above can return as soon as the first match is found,
and the rest of the matches may be performed lazily by C.
=item *
The new C modifier causes this regex to I the current
string for modification rather than assuming copy-on-write semantics.
All the captures in C become lvalues into the string, such
that if you modify, say, C, the original string is modified in
that location, and the positions of all the other fields modified
accordingly (whatever that means). In the absence of this modifier
(especially if it isn't implemented yet, or is never implemented),
all pieces of C are considered copy-on-write, if not read-only.
[Conjecture: this should really associate a pattern with a string variable,
not a (presumably immutable) string value.]
=item *
The new C modifier causes this regex and all invoked subrules
to remember everything, even if the rules themselves don't ask for
their subrules to be remembered. This is for forcing a grammar that
throws away whitespace and comments to keep them instead.
=item *
The new C modifier causes this regex to not backtrack by default.
(Generally you do not use this modifier directly, since it's implied by
C and C declarations.) The effect of this modifier is
to imply a C after every construct that could backtrack, including
bare C, C, and C> quantifiers, as well as alternations.
(Note: for portions of patterns subject to longest-token analysis, a C
is ignored in any case, since there will be no backtracking necessary.)
=item *
The new C modifier causes this regex and all invoked subrules
to try to backtrack on any rules that would otherwise default to
not backtracking because they have C set. Never panic
unless you're desperate and want the pattern matcher to do a lot of
unnecessary work. If you have an error in your grammar, it's almost
certainly a bad idea to fix it by backtracking.
=item *
The C, C, C, and Unicode-level modifiers can be
placed inside the regex (and are lexically scoped):
m/:s alignment '=' [:i left|right|cent[er|re]] /
As with modifiers outside, only parentheses are recognized as valid
brackets for args to the adverb. In particular:
m/:foo[xxx]/ Parses as :foo [xxx]
m/:foo{xxx}/ Parses as :foo {xxx}
m/:foo/ Parses as :foo
=item *
User-defined modifiers will be possible:
m:fuzzy/pattern/;
=item *
User-defined modifiers can also take arguments, but only in parentheses:
m:fuzzy('bare')/pattern/;
=item *
To use parens for your delimiters you have to separate:
m:fuzzy (pattern);
or you'll end up with:
m:fuzzy(fuzzyargs); pattern ;
=back
=head1 Changed metacharacters
=over
=item *
A dot C now matches I character including newline. (The C
modifier is gone.)
=item *
C and C now always match the start/end of a string, like the old
C and C. (The C modifier is gone.) On the right side of
an embedded C or C operator they always match the start/end
of the indicated submatch because that submatch is logically being
treated as a separate string.
=item *
A C no longer matches an optional preceding C so it's necessary
to say C if that's what you mean.
=item *
C now matches a logical (platform independent) newline not just C.
=item *
The C, C, and C metacharacters are gone.
=back
=head1 New metacharacters
=over
=item *
Because C is default:
=over
=item *
An unescaped C now always introduces a comment. If followed
by an opening bracket character (and if not in the first column),
it introduces an embedded comment that terminates with the closing
bracket. Otherwise the comment terminates at the newline.
=item *
Whitespace is now always metasyntactic, i.e. used only for layout
and not matched literally (but see the C modifier described above).
=back
=item *
C and C match line beginnings and endings. (The C
modifier is gone.) They are both zero-width assertions. C
matches before any C (logical newline), and also at the end of
the string if the final character was I a C. C always
matches the beginning of the string and after any C that is not
the final character in the string.
=item *
C matches an I, while C matches an I. (The C modifier is gone.) In particular, C matches
neither carriage return nor line feed.
=item *
The new C metacharacter separates conjunctive terms. The patterns
on either side must match with the same beginning and end point.
Note: if you don't want your two terms to end at the same point,
then you really want to use a lookahead instead.
As with the disjunctions C and C, conjuctions come in both
C and C forms. The C form is considered declarative rather than
procedural; it allows the compiler and/or the
run-time system to decide which parts to evaluate first, and it is
erroneous to assume either order happens consistently. The C
form guarantees left-to-right order, and backtracking makes the right
argument vary faster than the left. In other words, C and C establish
sequence points. The left side may be backtracked into when backtracking
is allowed into the construct as a whole.
The C operator is list associative like C, but has slightly
tighter precedence. Likewise C has slightly tighter precedence
than C. As with the normal junctional and short-circuit operators,
C and C are both tighter than C and C.
=item *
The C and C operators cause a submatch to be performed on
whatever was matched by the variable or atom on the left. String
anchors consider that submatch to be the entire string. So, for
instance, you can ask to match any identifier that does not contain
the word "moose":
!~~ 'moose'
In contrast
!~~ ^ 'moose' $
would allow any identifier (including any identifier containing
"moose" as a substring) as long as the identifier as a whole is not
equal to "moose". (Note the anchors, which attach the submatch to the
beginning and end of the identifier as if that were the entire match.)
When used as part of a longer match, for clarity it might be good to
use extra brackets:
[ !~~ ^ 'moose' $ ]
The precedence of C and C fits in between the junctional and
sequential versions of the logical operators just as it does in normal
Perl expressions (see S03). Hence
!~~ 'moose' | 'squirrel'
parses as
!~~ [ 'moose' | 'squirrel' ]
while
!~~ 'moose' || 'squirrel'
parses as
[ !~~ 'moose' ] || 'squirrel'
=back
=head1 Bracket rationalization
=over
=item *
C still delimits a capturing group. However the ordering of these
groups is hierarchical rather than linear. See L.
=item *
C is no longer a character class.
It now delimits a non-capturing group.
=item *
C is no longer a repetition quantifier.
It now delimits an embedded closure. It is always considered
procedural rather than declarative; it establishes a sequence point
between what comes before and what comes after. (To avoid this
use the C<< {...}> >> assertion syntax instead.)
=item *
You can call Perl code as part of a regex match by using a closure.
Embedded code does not usually affect the match--it is only used
for side-effects:
/ (\S+) { print "string not blank\n"; $text = $0; }
\s+ { print "but does contain whitespace\n" }
/
An B reduction using the C function sets the I
for this match:
/ (\d) { make $0.sqrt } Remainder /;
This has the effect of capturing the square root of the numified string,
instead of the string. The C part is matched but is not returned
unless the first C is later overridden by another C.
These closures are invoked with a topic (C) of the current match
state (a C object). Within a closure, the instantaneous
position within the search is denoted by the C method on
that object. As with all string positions, you must not treat it
as a number unless you are very careful about which units you are
dealing with.
The C object can also return the original item that we are
matching against; this is available from the C method, named to
remind you that it probably came from the user's C variable.
(But that may well be off in some other scope when indirect rules
are called, so we mustn't rely on the user's lexical scope.)
The closure is also guaranteed to start with a C C object
representing the match so far. However, if the closure does its own
internal matching, its C variable will be rebound to the result
of I match until the end of the embedded closure.
=item *
It can affect the match if it calls C:
/ (\d+) { $0 < 256 or fail } /
Since closures establish a sequence point, they are guaranteed to be
called at the canonical time even if the optimizer could prove that
something after them can't match. (Anything before is fair game,
however. In particular, a closure often serves as the terminator
of a longest-token pattern.)
=item *
The general repetition specifier is now C for maximal matching,
with a corresponding C for minimal matching. (All such quantifier
modifiers now go directly after the C.) Space is allowed on either
side of the complete quantifier. This space is considered significant
under C, and will be distributed as a call to between
all the elements of the match but not on either end.
The next token will determine what kind of repetition is desired:
If the next thing is an integer, then it is parsed as either as an exact
count or a range:
. ** 42 # match exactly 42 times
** 3..* # match 3 or more times
This form is considered declarational.
If you supply a closure, it should return either an C or a C object.
'x' ** {$m} # exact count returned from closure
** {$m..$n} # range returned from closure
/ value was (\d **? {1..6}) with ([ \w* ]**{$m..$n}) /
It is illegal to return a list, so this easy mistake fails:
/ [foo] ** {1,3} /
The closure form is always considered procedural, so the item it is
modifying is never considered part of the longest token.
If you supply any other atom (which may be quantified), it is
interpreted as a separator (such as an infix operator), and the
initial item is quantified by the number of times the separator is
seen between items:
** '|' # repetition controlled by presence of character
** # repetition controlled by presence of subrule
** [ \!?'==' ] # repetition controlled by presence of operator
**\h+ # repetition controlled by presence of whitespace
A successful match of such a quantifier always ends "in the middle",
that is, after the initial item but before the next separator.
Therefore
/ ** ',' /
can match
foo
foo,bar
foo,bar,baz
but never
foo,
foo,bar,
It is legal for the separator to be zero-width as long as the pattern on
the left progresses on each iteration:
. ** # match sequence of identical characters
The separator never matches independently of the next item; if the
separator matches but the next item fails, it backtracks all the way
back through the separator. Likewise, this matching of the separator
does not count as "progress" under C semantics unless the
next item succeeds.
When significant space is used under C with the separator
form, it applies on both sides of the separator, so
mm/ ** ','/
mm/** ','/
mm/ **','/
all allow whitespace around the separator like this:
/ [',']* /
while
mm/**','/
excludes all significant whitespace:
/ [',']* /
Of course, you can always match whitespace explicitly if necessary, so to
allow whitespace after the comma but not before, you can say:
/ **[','\s*] /
=item *
C<< >> are now extensible metasyntax delimiters or I
(i.e. they replace Perl 5's crufty C syntax).
=back
=head1 Variable (non-)interpolation
=over
=item *
In Perl 6 regexes, variables don't interpolate.
=item *
Instead they're passed I to the regex engine, which can then decide
how to handle them (more on that below).
=item *
The default way in which the engine handles a scalar is to match it
as a C<< '...' >> literal (i.e. it does not treat the interpolated string
as a subpattern). In other words, a Perl 6:
/ $var /
is like a Perl 5:
/ \Q$var\E /
However, if C contains a C object, instead of attempting to
convert it to a string, it is called as a subrule, as if you said
C<< >>. (See assertions below.) This form does not capture,
and it fails if C is tainted.
However, a variable used as the left side of an alias or submatch
operator is not used for matching.
$x =
$0 ~~
If you do want to match C again and then use that as the submatch,
you can force the match using double quotes:
"$0" ~~
On the other hand, it is non-sensical to alias to something that is
not a variable:
"$0" = # ERROR
$0 = # okay
$x = # okay, temporary capture
$ = # okay, persistent capture
# same thing
Variables declared in capture aliases are lexically scoped to the
rest of the regex. You should not confuse this use of C<=> with
either ordinary assignment or ordinary binding. You should read
the C<=> more like the pseudoassignment of a declarator than like
normal assignment. It's more like the ordinary C operator,
since at the level regexes work, strings are immutable, so captures
are really just precomputed substr values. Nevertheless, when you
eventually use the values independently, the substr may be copied,
and then it's more like it was an assignment originally.
Capture variables of the form C<< $ >> may persist beyond
the lexical scope; if the match succeeds they are remembered in the
C object's hash, with a key corresponding to the variable name's
identifier. Likewise bound numeric variables persist as C, etc.
The capture performed by C<=> creates a new lexical variable if it does
not already exist in the current lexical scope. To capture to an outer
lexical variable you must supply an C as part of the name,
or perform the assignment from within a closure.
$x = [...] # capture to our own lexical $x
$OUTER::x = [...] # capture to existing lexical $x
[...] -> $tmp { let $x = $tmp } # capture to existing lexical $x
Note however that C (and C) are not guaranteed to be thread
safe on shared variables, so don't do that.
=item *
An interpolated array:
/ @cmds /
is matched as if it were an alternation of its elements. Ordinarily it
matches using junctive semantics:
/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
However, if it is a direct member of a C list, it uses sequential
matching semantics, even it's the only member of the list. Conveniently,
you can put C before the first member of an alternation, hence
/ || @cmds /
is equivalent to
/ [ @cmds[0] || @cmds[1] || @cmds[2] || ... ] /
Or course, you can also
/ | @cmds /
to be clear that you mean junctive semantics.
As with a scalar variable, each element is matched as a literal
unless it happens to be a C object, in which case it is matched
as a subrule. As with scalar subrules, a tainted subrule always fails.
All string values pay attention to the current C
and C settings, while C values use their own
C and C settings.
When you get tired of writing:
token sigil { '$' | '@' | '@@' | '%' | '&' | '::' }
you can write:
token sigil { < $ @ @@ % & :: > }
as long as you're careful to put a space after the initial angle so that
it won't be interpreted as a subrule. With the space it is parsed
like angle quotes in ordinary Perl 6 and treated as a literal array value.
=item *
Alternatively, if you predeclare a proto regex, you can write multiple
regexes for the same category, differentiated only by the symbol they
match. The symbol is specified as part of the "long name". It may also
be matched within the rule using C<< >>, like this:
proto token sigil { }
multi token sigil:sym { }
multi token sigil:sym { }
multi token sigil:sym { }
multi token sigil:sym { }
multi token sigil:sym { }
multi token sigil:sym { }
(The C is optional and generally omitted with a grammar.)
This can be viewed as a form of multiple dispatch, except that it's
based on longest-token matching rather than signature matching. The
advantage of writing it this way is that it's easy to add additional
rules to the same category in a derived grammar. All of them will
be matched in parallel when you try to match C<< // >>.
If there are formal parameters on multi regex methods, matching
still proceeds via longest-token rules first. If that results in a
tie, a normal multiple dispatch is made using the arguments to the
remaining variants, assuming they can be differentiated by type.
=item *
An interpolated hash provides a way of inserting various forms of
run-time table-driven submatching into a regex. An interpolated hash
matches the longest possible token (typically the longest combination
of key and value). The match fails if no entry matches. (A "" key
will match anywhere, provided no other entry takes precedence by the
longest token rule.)
In a context requiring a set of initial token patterns, the initial
token patterns are taken to be each key plus any initial token pattern
matched by the corresponding value (if the value is a string or regex).
The token patterns are considered to be canonicalized in the same way
as any surrounding context, so for instance within a case-insensitive
context the hash keys must match insensitively also.
Subsequent matching depends on the hash value:
=over 4
=item *
If the corresponding value of the hash element is a closure, it
is executed.
=item *
If the value is a string, it is matched literally, starting after where
the key left off matching. As a natural consequence, if the value is
C, nothing special happens except that the key match succeeds.
=item *
If it is a C object, it is executed as a subrule, with an
initial position I the matched key. (This is further described
below under the C<< >> notation.) As with scalar subrules,
a tainted subrule always fails, and no capture is attempted.
=item *
If the value is a number, this entry represents a "false match".
The match position is set back to before the current false match, and the
key is rematched using the same hash, but this time ignoring any keys
longer than the number. (This is measured in the default Unicode
level in effect where the hash was declared, usually graphemes. If
the current Unicode level is lower, the results are as if the string
to be matched had been upconverted to the hash's Unicode level. If
the current Unicode level is higher, the results are undefined if the
string contains any characters whose interpretation would be changed
by the higher Unicode level, such as language-dependent ligatures.)
=item *
Any other value causes the match to fail.
=back
All hash keys, and values that are strings, pay attention to the
C and C settings. (Subrules maintain their
own case settings.)
You may combine multiple hashes under the same longest-token
consideration by using declarative alternation:
%statement | %prefix | %term
This means that, despite being in a later hash, C<< %term >>
will be selected in preference to C<< %prefix >> because it's
the longer token. However, if there is a tie, the earlier hash wins,
so C<< %statement >> hides any C<< %prefix >> or C<< %term >>.
In contrast, if you use a procedural alternation:
[ %prefix || %term ]
a C<< %prefix >> would be selected in preference to a C<< %term >>.
(Which is not what you usually want if your language is to do longest-token
consistently.)
=item *
Variable matches are considered provisionally declarative,
on the assumption that the contents of the variable will not change
frequently. If it does change, it may force recalculation of any
analysis relying on its supposed declarative nature. (If you know
this is going to happen too often, put some kind of sequence point
before the variable to disable static analysis such as the generation
of longest-token automata.)
=back
=head1 Extensible metasyntax (C<< >>)
Both C<< < >> and C<< > >> are metacharacters, and are usually (but not
always) used in matched pairs. (Some combinations of metacharacters
function as standalone tokens, and these may include angles. These are
described below.) Most assertions are considered declarative;
procedural assertions will be marked as exceptions.
For matched pairs, the first character after C<< < >> determines the
nature of the assertion:
=over
=item *
If the first character is whitespace, the angles are treated as an
ordinary "quote words" array literal.
< adam & eve > # equivalent to [ 'adam' | '&' | 'eve' ]
=item *
A leading alphabetic character means it's a capturing grammatical
assertion (i.e. a subrule or a named character class - see below):
/ ? ? /
The first character after the identifier determines the treatment of
the rest of the text before the closing angle. The underlying semantics
is that of a function or method call, so if the first character is
a left parenthesis, it really is a call:
If the first character after the identifier is an C<=>, then the identifier
is taken as an alias for what follows. In particular,
is just shorthand for
$ =
If the first character after the identifier is whitespace, the
subsequent text (following any whitespace) is passed as a regex, so:
is more or less equivalent to
To pass a regex with leading whitespace you must use the parenthesized form.
If the first character is a colon, the rest of the text (following any
whitespace) is passed as a string, so the previous may also be written as:
To pass a string with leading whitespace, or to interpolate any values
into the string, you must use the parenthesized form.
No other characters are allowed after the initial identifier.
Subrule matches are considered declarative to the extent that
the front of the subrule is itself considered declarative. If a
subrule contains a sequence point, then so does the subrule match.
Longest-token matching does not proceed past such a subrule, for
instance.
=item *
The special named assertions include:
/ / # lookahead
/ / # lookbehind
/ / # true between two identical characters
/ / # match "whitespace":
# \s+ if it's between two \w characters,
# \s* otherwise
/ / # match only at a particular StrPos
# short for { .pos === $pos }>
# (considered declarative until $pos changes)
The C assertion implements lookbehind by reversing the syntax
tree and looking for things in the opposite order going to the left.
It is illegal to do lookbehind on a pattern that cannot be reversed.
Note: the effect of a forward-scanning lookbehind at the top level
can be achieved with:
/ .*? prestuff /
=item *
A leading C causes a named assertion not to capture what it matches (see
L. For example:
/ / # $/ and $/ both captured
/ / # only $/ captured
/ / # nothing captured
The non-capturing behavior may be overridden with a C.
=item *
A leading C indicates an indirect subrule. The variable must contain
either a C object, or a string to be compiled as the regex. The
string is never matched literally.
Such an assertion is not captured. (No assertion with leading punctuation
is captured by default.) You may always capture it explicitly, of course.
A subrule is considered declarative to the extent that the front of it
is declarative, and to the extent that the variable doesn't change.
Prefix with a sequence point to defeat repeated static optimizations.
=item *
A leading C indicates a symbolic indirect subrule:
/ /
The variable must contain the name of a subrule. By the rules of
single method dispatch this is first searched for in the current
grammar and its ancestors. If this search fails an attempt is made
to dispatch via MMD, in which case it can find subrules defined as
multis rather than methods. This form is not captured by default.
It is always considered procedural, not declarative.
=item *
A leading C matches like a bare array except that each element is
treated as a subrule (string or C object) rather than as a literal.
That is, a string is forced to be compiled as a subrule instead of being
matched literally. (There is no difference for a C object.)
This assertion is not automatically captured.
=item *
A leading C matches like a bare hash except that a string value is
always treated as a subrule, even if it is a string that must be compiled
to a regex at match time. (Numeric values may still indicate "false match".
and a closure may do whatever it likes.)
This assertion is not automatically captured.
As with bare hash, the longest key matches according to the venerable
I.
=item *
A leading C indicates code that produces a regex to be interpolated
into the pattern at that point as a subrule:
/ () /
The closure is guaranteed to be run at the canonical time; it declares
a sequence point, and is considered to be procedural.
=item *
A leading C interpolates the return value of a subroutine call as
a regex. Hence
is short for
This is considered procedural.
=item *
In any case of regex interpolation, if the value already happens to be
a C object, it is not recompiled. If it is a string, the compiled
form is cached with the string so that it is not recompiled next
time you use it unless the string changes. (Any external lexical
variable names must be rebound each time though.) Subrules may not be
interpolated with unbalanced bracketing. An interpolated subrule
keeps its own inner match result as a single item, so its parentheses never count toward the
outer regexes groupings. (In other words, parenthesis numbering is always
lexically scoped.)
=item *
A leading C{> or C indicates a code assertion:
/ (\d**1..3) { $0 < 256 }> /
/ (\d**1..3) /
Similar to:
/ (\d**1..3) { $0 < 256 or fail } /
/ (\d**1..3) { $0 < 256 and fail } /
Unlike closures, code assertions are considered declarative; they are
not guaranteed to be run at the canonical time if the optimizer can
prove something later can't match. So you can sneak in a call to a
non-canonical closure that way:
token { foo .* { do { say "Got here!" } or 1 }> .* bar }
The C block is unlikely to run unless the string ends with "C".
=item *
A leading C indicates an enumerated character class. Ranges
in enumerated character classes are indicated with "C" rather than "C".
/ * /
Whitespace is ignored within square brackets:
/ * /
=item *
A leading C indicates a complemented character class:
/ /
/ / # whitespace allowed after -
This is essentially the same as using negative lookahead and dot:
/ . . /
Whitespace is ignored after the initial C.
=item *
A leading C may also be supplied to indicate that the following
character class is to matched in a positive sense.
/ * /
/ * /
/ * / # whitespace allowed after +
=item *
Character classes can be combined (additively or subtractively) within
a single set of angle brackets. Whitespace is ignored. For example:
/ / # consonant or hex digit
A named character class may be used by itself:
However, in order to combine classes you must prefix a named
character class with C or C.
=item *
The special assertion C<< >> matches any logical grapheme
(including a Unicode combining character sequences):
/ seekto = / # Maybe a combined char
Same as:
/ seekto = [:graphs .] /
=item *
A leading C indicates a negated meaning (always a zero-width assertion):
/ / # We aren't before an _
Note that C<< >> is different from C<< >>.
C<< // >> is a complemented character class equivalent to
C<<< /> ./ >>>, whereas C<< >> is a zero-width
assertion equivalent to a />/ assertion.
Note also that as a metacharacter C doesn't change the parsing
rules of whatever follows (unlike, say, C or C).
=item *
A leading C> indicates a positive zero-width assertion, and like C
merely reparses the rest of the assertion recursively as if the C>
were not there. In addition to forcing zero-width, it also suppresses
any named capture:
# match a letter and capture to $alpha (eventually $)
# match a letter, don't capture
# match null before a letter, don't capture
=item *
A leading C indicates a recursive call back into some or all of
the current rule. An optional argument indicates which subpattern
to re-use, and if provided must resolve to a single subpattern.
If omitted, the entire pattern is called recursively:
# call myself recursively
# match according to $0's pattern
# match according to $foo's pattern
Note that this rematches the pattern associated with the name, not
the string matched. So
$_ = "foodbard"
/ ( foo | bar ) d $0 / # fails; doesn't match "foo" literally
/ ( foo | bar ) d / # fails; doesn't match /foo/ as subrule
/ ( foo | bar ) d / # matches using rule associated with $0
The last is equivalent to
/ ( foo | bar ) d ( foo | bar) /
Note that the "self" call of
/ /
calls back into this anonymous rule as a subrule, and is implicitly
anchored to the end of the operator as any other subrule would be.
Despite the fact that the outer rule scans the string, the inner
call to it does not.
Note that a consequence of previous section is that you also get
for free, which fails if the current rule would match again at this location.
=back
The following tokens include angles but are not required to balance:
=over
=item *
A C<< > token indicates the start of a result capture, while the
corresponding C<< )> >> token indicates its endpoint. When matched,
these behave as assertions that are always true, but have the side
effect of setting the C and C attributes of the match
object. That is:
/ foo bar /
is equivalent to:
/ \d+ /
except that the scan for "C" can be done in the forward direction,
while a lookbehind assertion would presumably scan for C and then
match "C" backwards. The use of C<< >> affects only the
meaning of the I and the positions of the beginning and
ending of the match. That is, after the match above, C contains
only the digits matched, and C is pointing to after the digits.
Other captures (named or numbered) are unaffected and may be accessed
through C.
These tokens are considered declarative, but may force backtracking behavior.
=item *
A C or C<<< << >>> token indicates a left word boundary. A C or
C<<< >> >>> token indicates a right word boundary. (As separate tokens,
these need not be balanced.) Perl 5's C is replaced by a C<< >>
"word boundary" assertion, while C becomes C<< >>. (None of
these are dependent on the definition of C<< >>, but only on the C
definition of "word" characters.)
=back
=head1 Backslash reform
=over
=item *
The C and C properties become intrinsic grammar rules such as
(C<< >> and C<< >>). They may be combined using the
above-mentioned character class notation: C<< >>.
Regardless of the higher-level character class names, low-level
Unicode properties are always available with a prefix of C.
Hence, C<< >> is equivalent to C<< >>.
If you define your own "is" properties they hide any Unicode properties
of the same name.
=item *
The C, C, and C sequences are gone. In the
rare cases that need them you can use C<< >> etc.
=item *
The C sequence is gone. Use C instead. (Note, however,
that it makes no sense to use C within a pattern, since every
internal pattern is implicitly anchored to the current position.)
See the C assertion below.
=item *
Backreferences (e.g. C, C, etc.) are gone; C, C, etc. can be
used instead, because variables are no longer interpolated.
Numeric variables are assumed to change every time and therefore are
considered procedural, unlike normal variables.
=item *
New backslash sequences, C and C, match horizontal and vertical
whitespace respectively, including Unicode.
=item *
C now matches any Unicode whitespace character.
=item *
The new backslash sequence C matches anything except a logical
newline; it is the negation of C.
=item *
A series of other new capital backslash sequences are also the negation
of their lower-case counterparts:
=over
=item *
C matches anything but horizontal whitespace.
=item *
C matches anything but vertical whitespace.
=item *
C matches anything but a tab.
=item *
C matches anything but a return.
=item *
C matches anything but a formfeed.
=item *
C matches anything but an escape.
=item *
C matches anything but the specified character (specified in
hexadecimal).
=back
=back
=head1 Regexes are now first-class language, not strings
=over
=item *
The Perl 5 C regex constructor is gone.
=item *
The Perl 6 equivalents are:
regex { pattern } # always takes {...} as delimiters
rx / pattern / # can take (almost any) chars as delimiters
You may not use whitespace or alphanumerics for delimiters. Space is
optional unless needed to distinguish from modifier arguments or
function parens. So you may use parens as your C delimiters,
but only if you interpose whitespace:
rx ( pattern ) # okay
rx( 1,2,3 ) # tries to call rx function
(This is true for all quotelike constructs in Perl 6.)
=item *
If either form needs modifiers, they go before the opening delimiter:
$regex = regex :g:s:i { my name is (.*) };
$regex = rx:g:s:i / my name is (.*) /; # same thing
Space is necessary after the final modifier if you use any
bracketing character for the delimiter. (Otherwise it would be taken as
an argument to the modifier.)
=item *
You may not use colons for the delimiter. Space is allowed between
modifiers:
$regex = rx :g :s :i / my name is (.*) /;
=item *
The name of the constructor was changed from C because it's no
longer an interpolating quote-like operator. C is short for I