This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

In essence, Perl 6 natively implements Parsing Expression Grammars (PEGs) as an extension of regular expression notation. PEGs require that you provide a "pecking order" for ambiguous parses. Perl 6's pecking order is determined by a multi-level tie-breaking test:

1) Most-derived only/proto hides less-derived of the same name
2) Longest token matching: food\s+ beats foo by 2 or more positions
3) Longest literal prefix: food\w* beats foo\w* by 1 position
4) For a given proto, multis from a more-derived grammar win
5) Within a given compilation unit, the earlier alternative or multi wins

Tiebreaker #3 will treat any initial sequence of literals as the longest literal prefix. If there is an alternation embedded in the longest token matching, those alternations can extend a literal prefix provided everything was literal up to the alternation. If all of the alternations are totally literal, then the literal can also extend beyond the end of the alternation when they rejoin. Otherwise the end of the alternation terminates all longest literal prefixes, even the branches that are totally literal. For example:

Note that in this case, a character class is not treated the same as an alternation. All character classes are considered too generic to include in a longest literal string.

As with longest token matching, longest literal prefixes are transitive through subrules. If a subrule is a protoregex, it is treated just as alternation with | is, and follows the same rules about extending or terminating the longest literal prefix.

In addition to this pecking order, if any rule chosen under the pecking backtracks, the next best rule is chosen. That is, the pecking order determines a candidate list; just because one candidate is chosen does not mean the rest are thrown away. They may, however, be explicitly thrown away by an appropriate backtracking control (sometimes called a "cut" operator, but Perl 6 has several of them, depending on how much you want to cut).

Also, any rule chosen to execute under #1 may choose to delegate to its ancestors; PEG backtracking has no control over this.

The underlying match object is now available via the $/ variable, which is implicitly lexically scoped. All user access to the most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

In order to detect accidental use of Perl 5's unrelated $/ variable, Perl 6's $/ variable may not be assigned to directly.

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokener for a grammar because Perl will write one for you. See the section below on "Longest-token matching".

Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the same interpolative semantics as the current language in which the regex is lexically embedded.) Quotes create a quantifiable atom, so while

In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.

Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1\_-!). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.

There are no /s or /m modifiers (changes to the meta-characters replace them - see below).

There is no /e evaluation modifier on substitutions; instead use:

s/pattern/{ doit() }/

or:

s[pattern] = doit()

Instead of /ee say:

s/pattern/{ EVAL doit() }/

or:

s[pattern] = doit().EVAL

Modifiers are now placed as adverbs at the start of a match/substitution:

m:g:i/\s* (\w*) \s* ,?/;

Every modifier must start with its own colon. The delimiter must be separated from the final modifier by whitespace if it would otherwise be taken as an argument to the preceding modifier (which is true if and only if the next character is a left parenthesis.)

The single-character modifiers also have longer versions:

:i :ignorecase
:m :ignoremark
:g :global
:r :ratchet

The :i (or :ignorecase) modifier causes case distinctions to be ignored in its lexical scope, but not in its dynamic scope. That is, subrules always use their own case settings. The amount of case folding depends on the current context. In byte and codepoint mode, level 1 case folding is required (as defined in TR18 section 2.4). In grapheme mode level 2 is required.

The :ii (or :samecase) variant may be used on a substitution to change the substituted string to the same case pattern as the matched string. It implies the same pattern semantics as :i above, so it is not necessary to put both :i and :ii.

If the pattern is matched without the :sigspace modifier, case info is carried across on a character by character basis. If the right string is longer than the left one, the case of the final character is replicated. Titlecase is carried across if possible regardless of whether the resulting letter is at the beginning of a word or not; if there is no titlecase character available, the corresponding uppercase character is used. (This policy can be modified within a lexical scope by a language-dependent Unicode declaration to substitute titlecase according to the orthographic rules of the specified language.) Characters that carry no case information leave their corresponding replacement character unchanged.

If the pattern is matched with :sigspace, then a slightly smarter algorithm is used which attempts to determine if there is a uniform capitalization policy over each matched word, and applies the same policy to each replacement word. If there doesn't seem to be a uniform policy on the left, the policy for each word is carried over word by word, with the last pattern word replicated if necessary. If a word does not appear to have a recognizable policy, the replacement word is translated character for character as in the non-sigspace case. Recognized policies include:

In any case, only the officially matched string part of the pattern match counts, so any sort of lookahead or contextual matching is not included in the analysis.

The :m (or :ignoremark) modifier scopes exactly like :ignorecase except that it ignores marks (accents and such) instead of case. It is equivalent to taking each grapheme (in both target and pattern), converting both to NFD (maximally decomposed) and then comparing the two base characters (Unicode non-mark characters) while ignoring any trailing mark characters. The mark characters are ignored only for the purpose of determining the truth of the assertion; the actual text matched includes all ignored characters, including any that follow the final base character.

The :mm (or :samemark) variant may be used on a substitution to change the substituted string to the same mark/accent pattern as the matched string. It implies the same pattern semantics as :m above, so it is not necessary to put both :m and :mm.

Mark info is carried across on a character by character basis. If the right string is longer than the left one, the remaining characters are substituted without any modification. (Note that NFD/NFC distinctions are usually immaterial, since Perl encapsulates that in grapheme mode.) Under :sigspace the preceding rules are applied word by word.

The :c (or :continue) modifier causes the pattern to continue scanning from the specified position (defaulting to ($/ ?? $/.to !! 0)):

If the argument is omitted, it defaults to ($/ ?? $/.to !! 0). (Unlike in Perl 5, the string itself has no clue where its last match ended.) All subrule matches are implicitly passed their starting position. Likewise, the pattern you supply to a Perl macro's is parsed trait has an implicit :p modifier.

Note that

m:c($p)/pattern/

is roughly equivalent to

m:p($p)/.*? <( pattern )> /

All of :g, :ov, :nth, and :x are incompatible with :p and will fail, recommending use of :c instead. The :ex modifier is allowed but will produce only matches at that position.

The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, <.ws>. Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.

<.ws> can't decide what to do until it sees the data. It still does the right thing. If not, define your own ws and :sigspace will use that.

Whitespace is ignored not just at the front of any rule that might participate in longest-token matching, but in the front of any alternative within an explicit alternation as well, for the same reason. If you want to match sigspace before a set of alternatives, place your whitespace outside of the brackets containing the alternation.

When you write

rule TOP { ^ <stuff> $ }

this is the same as

token TOP { ^ <.ws> <stuff> <.ws> $ <.ws> }

but note that the final <.ws> always matches the null string, since $ asserts end of string. Also, if your TOP rule does not anchor with ^, it might not match initial whitespace.

Specifically, the following constructs turn following whitespace into sigspace:

When we say sigspace can follow either an atom or a quantified atom, we mean that it can come between an atom and its quantifier:

ms/ <atom> * / # means / [<atom><.ws>]* /

(If each atom matches whitespace, then it doesn't need to match after the quantifier.)

In general you don't need to use :sigspace within grammars because the parser rules automatically handle whitespace policy for you. In this context, whitespace often includes comments, depending on how the grammar chooses to define its whitespace rule. Although the default <.ws> subrule recognizes no comment construct, any grammar is free to override the rule. The <.ws> rule is not intended to mean the same thing everywhere.

It's also possible to pass an argument to :sigspace specifying a completely different subrule to apply. This can be any rule, it doesn't have to match whitespace. When discussing this modifier, it is important to distinguish the significant whitespace in the pattern from the "whitespace" being matched, so we'll call the pattern's whitespace sigspace, and generally reserve whitespace to indicate whatever <.ws> matches in the current grammar. The correspondence between sigspace and whitespace is primarily metaphorical, which is why the correspondence is both useful and (potentially) confusing.

The :ss (or :samespace) variant may be used on substitutions to do smart space mapping in addition to smart space matching. (That is, :ss implies :s.) For each sigspace-induced call to <ws> on the left, the matched whitespace is copied over to the corresponding slot on the right, as represented by a single whitespace character in the replacement string wherever space replacement is desired. If there are more whitespace slots on the right than the left, those righthand characters remain themselves. If there are not enough whitespace slots on the right to map all the available whitespace slots from the match, the algorithm tries to minimize information loss by randomly splicing "common" whitespace characters out of the list of whitespace. From least valuable to most, the pecking order is:

spaces
tabs
all other horizontal whitespace, including Unicode
newlines (including crlf as a unit)
all other vertical whitespace, including Unicode

The primary intent of these rules is to minimize format disruption when substitution happens across line boundaries and such. There is, of course, no guarantee that the result will be exactly what a human would do.

The :s modifier is considered sufficiently important that match variants are defined for them:

There are corresponding pragmas to default to these levels. Note that the :chars modifier is always redundant because dot always matches characters at the highest level allowed in scope. This highest level may be identical to one of the other three levels, or it may be more specific than :graphs when a particular language's character rules are in use. Note that you may not specify language-dependent character processing without specifying which language you're depending on. [Conjecture: the :chars modifier could take an argument specifying which language's rules to use for this match.]

The new :Perl5/:P5 modifier allows Perl 5 regex syntax to be used instead. (It does not go so far as to allow you to put your modifiers at the end.) For instance,

m:P5/(?mi)^(?:[a-z]|\d){1,2}(?=\s)/

is equivalent to the Perl 6 syntax:

m/ :i ^^ [ <[a..z]> || \d ] ** 1..2 <?before \s> /

Any integer modifier specifies a count. What kind of count is determined by the character that follows.

If followed by an x, it means repetition. Use :x(4) for the general form. So

The argument to :nth is allowed to be a list of integers, but such a list should be monotonically increasing. (Values which are less than or equal to any previous value will be ignored.) So:

:nth(2,4,6...*) # return only even matches
:nth(1,1,*+*...*) # match only at 1,2,3,5,8,13...

This option is no longer required to support smartmatching. You can grep a list of integers if you really need that capability:

:nth(grep *.oracle, 1..*)

If both :nth and :x are present, the matching routine looks for submatches that match with :nth. If the number of post-nth matches is compatible with the constraint in :x, the whole match succeeds with the highest possible number of submatches. The combination of :nth and :x typically only makes sense if :nth is not a single scalar.

With the new :ov (:overlap) modifier, the current regex will match at all possible character positions (including overlapping) and return all matches in list context, or a disjunction of matches in item context. The first match at any position is returned. The matches are guaranteed to be returned in left-to-right order with respect to the starting positions.

The matches are guaranteed to be returned in left-to-right order with respect to the starting positions. The order within each starting position is not guaranteed and may depend on the nature of both the pattern and the matching engine. (Conjecture: or we could enforce backtracking engine semantics. Or we could guarantee no order at all unless the pattern starts with "::" or some such to suppress DFAish solutions.)

Note that the ~~ above can return as soon as the first match is found, and the rest of the matches may be performed lazily by @().

The new :rw modifier causes this regex to claim the current string for modification rather than assuming copy-on-write semantics. All the captures in $/ become lvalues into the string, such that if you modify, say, $1, the original string is modified in that location, and the positions of all the other fields modified accordingly (whatever that means). In the absence of this modifier (especially if it isn't implemented yet, or is never implemented), all pieces of $/ are considered copy-on-write, if not read-only.

[Conjecture: this should really associate a pattern with a string variable, not a (presumably immutable) string value.]

The new :r or :ratchet modifier causes this regex to not backtrack by default. (Generally you do not use this modifier directly, since it's implied by token and rule declarations.) The effect of this modifier is to imply a : after every atom, including but not limited to *, +, and ? quantifiers, as well as alternations. Explicit backtracking modifiers on quantified atoms, such as **, will override this. (Note: for portions of patterns subject to longest-token analysis, a : is ignored in any case, since there will be no backtracking necessary.)

User-defined modifiers can also take arguments, but only in parentheses:

m:fuzzy('bare')/pattern/;

To use parens for your delimiters you have to separate:

m:fuzzy (pattern);

or you'll end up with:

m:fuzzy(fuzzyargs); pattern ;

Any grammar regex is really just a kind of method, and you may declare variables in such a routine using a colon followed by any scope declarator parsed by the Perl 6 grammar, including my, our, state, and constant. (As quasi declarators, temp and let are also recognized.) A single statement (up through a terminating semicolon or line-final closing brace) is parsed as normal Perl 6 code:

Some modifiers are allowed in all possible places where modifiers can occur, but not all of them.

In general, a modifier that affects the compilation of a regex (like :i) must be known at compile time. A modifier that affects only the calling behaviour, and not the regex itself (eg. :pos, :overlap, :x(4)) may only appear on constructs that involve a call (like m// and s///), and not on rx//. Finally overlapping is disallowed on substitutions, while adverbs that affect modifications (eg. :samecase) are only allowed on substitutions.

These principle result in the following rules:

The :ignorecase, :ignoremark, :sigspace, :ratchet and :Perl5 modifiers and their short forms are allowed everywhere: inside a regex, and on m//, rx// and s/// constructs. An implementation may require that their value is known at compile time, and give a compile-time error message if that is not the case.

A dot . now matches any character including newline. (The /s modifier is gone.)

^ and $ now always match the start/end of a string, like the old \A and \z. (The /m modifier is gone.) On the right side of an embedded ~~ or !~~ operator they always match the start/end of the indicated submatch because that submatch is logically being treated as a separate string.

A $ no longer matches an optional preceding \n so it's necessary to say \n?$ if that's what you mean.

\n now matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.

An unquoted # now always introduces a comment. If followed by a backtick and an opening bracket character, it introduces an embedded comment that terminates with the closing bracket. Otherwise the comment terminates at the newline.

Whitespace is now always metasyntactic, i.e. used only for layout and not matched literally (but see the :sigspace modifier described above).

^^ and $$ match line beginnings and endings. (The /m modifier is gone.) They are both zero-width assertions. $$ matches before any \n (logical newline), and also at the end of the string if the final character was not a \n. ^^ always matches the beginning of the string and after any \n that is not the final character in the string.

The new & metacharacter separates conjunctive terms. The patterns on either side must match with the same beginning and end point. Note: if you don't want your two terms to end at the same point, then you really want to use a lookahead instead.

As with the disjunctions | and ||, conjunctions come in both & and && forms. The & form is considered declarative rather than procedural; it allows the compiler and/or the run-time system to decide which parts to evaluate first, and it is erroneous to assume either order happens consistently. The && form guarantees left-to-right order, and backtracking makes the right argument vary faster than the left. In other words, && and || establish sequence points. The left side may be backtracked into when backtracking is allowed into the construct as a whole.

The & operator is list associative like |, but has slightly tighter precedence. Likewise && has slightly tighter precedence than ||. As with the normal junctional and short-circuit operators, & and | are both tighter than && and ||.

The ~~ and !~~ operators cause a submatch to be performed on whatever was matched by the variable or atom on the left. String anchors consider that submatch to be the entire string. So, for instance, you can ask to match any identifier that does not contain the word "moose":

<ident> !~~ 'moose'

In contrast

<ident> !~~ ^ 'moose' $

would allow any identifier (including any identifier containing "moose" as a substring) as long as the identifier as a whole is not equal to "moose". (Note the anchors, which attach the submatch to the beginning and end of the identifier as if that were the entire match.) When used as part of a longer match, for clarity it might be good to use extra brackets:

[ <ident> !~~ ^ 'moose' $ ]

The precedence of ~~ and !~~ fits in between the junctional and sequential versions of the logical operators just as it does in normal Perl expressions (see S03). Hence

<ident> !~~ 'moose' | 'squirrel'

parses as

<ident> !~~ [ 'moose' | 'squirrel' ]

while

<ident> !~~ 'moose' || 'squirrel'

parses as

[ <ident> !~~ 'moose' ] || 'squirrel'

The ~ operator is a helper for matching nested subrules with a specific terminator as the goal. It is designed to be placed between an opening and closing bracket, like so:

However, it mostly ignores the left argument, and operates on the next two atoms (which may be quantified). Its operation on those next two atoms is to "twiddle" them so that they are actually matched in reverse order. Hence the expression above, at first blush, is merely shorthand for:

'(' <expression> ')'

But beyond that, when it rewrites the atoms it also inserts the apparatus that will set up the inner expression to recognize the terminator, and to produce an appropriate error message if the inner expression does not terminate on the required closing atom. So it really does pay attention to the left bracket as well, and it actually rewrites our example to something more like:

$<OPEN> = '(' <SETGOAL: ')'> <expression> [ $GOAL || <FAILGOAL> ]

Note that you can use this construct to set up expectations for a closing construct even when there's no opening bracket:

<?> ~ ')' \d+

Here <?> returns true on the first null string.

By default the error message uses the name of the current rule as an indicator of the abstract goal of the parser at that point. However, often this is not terribly informative, especially when rules are named according to an internal scheme that will not make sense to the user. The :dba("doing business as") adverb may be used to set up a more informative name for what the following code is trying to parse:

{...} is no longer a repetition quantifier. It now delimits an embedded closure. It is always considered procedural rather than declarative; it establishes a sequence point between what comes before and what comes after. (To avoid this use the <?{...}> assertion syntax instead.) A closure within a regex establishes its own lexical scope.

This has the effect of capturing the square root of the numified string, instead of the string. The Remainder part is matched and returned as part of the Match object but is not returned as part of the abstract object. Since the abstract object usually represents the top node of an abstract syntax tree, the abstract object may be extracted from the Match object by use of the .made method.

A second call to make overrides any previous call to make. make is also available as a method on each match object.

Within a closure, the instantaneous position within the search is denoted by the $¢.pos method. As with all string positions, you must not treat it as a number unless you are very careful about which units you are dealing with.

The Cursor object can also return the original item that we are matching against; this is available from the .orig method.

The closure is also guaranteed to start with a $/Match object representing the match so far. However, if the closure does its own internal matching, its $/ variable will be rebound to the result of that match until the end of the embedded closure. (The match will actually continue with the current value of the $¢ object after the closure. $/ and $¢ just start out the same in your closure.)

It can affect the match if it calls fail:

/ (\d+) { $0 < 256 or fail } /

Since closures establish a sequence point, they are guaranteed to be called at the canonical time even if the optimizer could prove that something after them can't match. (Anything before is fair game, however. In particular, a closure often serves as the terminator of a longest-token pattern.)

The general repetition specifier is now ** for greedy matching, with a corresponding **? for frugal matching. (All such quantifier modifiers now go directly after the **.) Space is allowed on either side of the complete quantifier, but only the space before the ** will be considered significant under :sigspace and match between repetitions. (Sigspace after the entire construct matches once after the all repetitions are found.)

The closure form is always considered procedural, so the item it is modifying is never considered part of the longest token.

For backwards compatibility with previous versions of Perl 6, if the token following ** is not a closure or literal integer, it is interpreted as +% with a warning:

/ x ** y / # same as / x+ % y /
/ x ** $y / # same as / x [$y x]* /

No check is made to see if $y contains an integer or range value. This compatibility feature is not guaranteed to exist forever.

Negative range values are allowed, but only when modifying a reversible pattern (such as after could match). For example, to search the surrounding 200 characters as defined by 'dot', you could say:

/ . ** -100..100 <element> /

Similarly, you can back up 50 characters with:

/ . ** -50 <element> /

[Conjecture: A negative quantifier forces the construct to be considered procedural rather than declarational.]

Any quantified atom may be modified by an additional constraint that specifies the separator to look for between repeats of the left side. This is indicated by use of a % between the quantifier and the separator. The initial item is iterated only as long as the separator is seen between items:

The % modifier may only be used on a quantifier; any attempt to use it on a bare term will result in a parse error (to minimize possible confusion with any hash notations we choose to support in Perl 6 regexes).

A successful match of a % construct generally ends "in the middle" at the %, that is, after the initial item but before the next separator. Therefore

/ <ident>+ % ',' /

can match

foo
foo,bar
foo,bar,baz

but never

foo,
foo,bar,

The only time such a match doesn't end in the middle is if the left side can match 0 times (and does so), in which case the whole construct matches the null string.

'' ~~ / <ident>* % ',' / # matches because of the *

If you wish to allow the match to end after either side, use %% instead. Therefore

/ <ident>+ %% ',' /

can match any of

foo
foo,
foo,bar
foo,bar,
foo,bar,baz
foo,bar,baz,

If you wish to quantify each match on the left individually, you must place it in brackets:

[<a>*]+ % ','

It is legal for the separator to be zero-width as long as the pattern on the left progresses on each iteration:

.+ % <?same> # match sequence of identical characters

The separator never matches independently of the next item; if the separator matches but the next item fails, it backtracks all the way back through the separator. Likewise, this matching of the separator does not count as "progress" under :ratchet semantics unless the next item succeeds.

When significant space is used under :sigspace, each matching element enables the immediately following whitespace to be considered significant. Space after the % does nothing. If you write:

ms/ <element> + % ',' /
#1 #2 #3 #4 #5

it ignores whitespace #1 and #4, and rewrites the rest to:

/ [ <element> <.ws> ]+ % [ ',' <.ws> ] <.ws> /
#2 #5 #3

Since #3 is redundant with #2 (because + requires an element), it suffices to supply either #2 or #3:

Note that with a * instead of a +, space #3 would not be redundant with #2, since if 0 elements are matched, the space associated with it (#2) is not matched. In that case it makes sense to put space on both sides of the *:

When matching against a Stringy type that is not Str, the variable must be interpretable as a value of that Stringy type (or a related type that can be coerced to that type). For example, when regex matching a Buf type, the variable will be matched under the Buf type's semantics, not Str semantics.

[Conjecture: when we allow matching against non-string types, doing a type match on the current node will require the syntax of an embedded signature, not just a bare variable, so there is no need to account for a variable containing a type object, which is by definition undefined, and hence fails to match by the above rule.]

However, a variable used as the left side of an alias or submatch operator is not used for matching.

$x = <.ident>
$0 ~~ <.ident>

If you do want to match $0 again and then use that as the submatch, you can force the match using double quotes:

"$0" ~~ <.ident>

On the other hand, it is non-sensical to alias to something that is not a variable:

Variables declared in capture aliases are lexically scoped to the rest of the regex. You should not confuse this use of = with either ordinary assignment or ordinary binding. You should read the = more like the pseudoassignment of a declarator than like normal assignment. It's more like the ordinary := operator, since at the level regexes work, strings are immutable, so captures are really just precomputed substr values. Nevertheless, when you eventually use the values independently, the substr may be copied, and then it's more like it was an assignment originally.

Capture variables of the form $<ident> may persist beyond the lexical scope; if the match succeeds they are remembered in the Match object's hash, with a key corresponding to the variable name's identifier. Likewise bound numeric variables persist as $0, etc.

You may capture to existing lexical variables; such variables may already be visible from an outer scope, or may be declared within the regex via a :my declaration.

is matched as if it were an alternation of its literal elements. Ordinarily it matches using junctive semantics:

/ [ $(@cmds[0]) | $(@cmds[1]) | $(@cmds[2]) | ... ] /

However, if it is a direct member of a || list, it uses sequential matching semantics, even it's the only member of the list. Conveniently, you can put || before the first member of an alternation, hence

/ || @cmds /

is equivalent to

/ [ $(@cmds[0]) || $(@cmds[1]) || $(@cmds[2]) || ... ] /

Or course, you can also

/ | @cmds /

to be clear that you mean junctive semantics.

Note the usage of $(...) to prevent the subscripts from being parsed as regex syntax rather than an actual subscript.

Since $x is interpolated as if you'd said "$x", if $x contains a list, it is stringified first. To get alternation you must use the @$x or @($x) form to indicate that you're intending the scalar variable to be treated as a list.

An interpolated array using junctive semantics is declarative (participates in external longest token matching) only if it's known to be constant at the time the regex is compiled.

As with a scalar variable, each element is matched as a literal. All such values pay attention to the current :ignorecase and :ignoremark settings.

When you get tired of writing:

token sigil { '$' | '@' | '%' | '&' | '::' }

you can write:

token sigil { < $ @ % & :: > }

as long as you're careful to put a space after the initial angle so that it won't be interpreted as a subrule. With the space it is parsed like angle quotes in ordinary Perl 6 and treated as a literal array value.

Alternatively, if you predeclare a proto regex, you can write multiple regexes for the same category, differentiated only by the symbol they match. The symbol is specified as part of the "long name". It may also be matched within the rule using <sym>, like this:

This can be viewed as a form of multiple dispatch, except that it's based on longest-token matching rather than signature matching. The advantage of writing it this way is that it's easy to add additional rules to the same category in a derived grammar. All of them will be matched in parallel when you try to match /<sigil>/.

If there are formal parameters on multi regex methods, matching still proceeds via longest-token rules first. If that results in a tie, a normal multiple dispatch is made using the arguments to the remaining variants, assuming they can be differentiated by type.

The proto calls into the subdispatcher when it sees a * that cannot be a quantifier and is the only thing in its block. Therefore you can put items before and after the subdispatch by putting the * into curlies:

proto token foo { <prestuff> {*} <poststuff> }

This works only in a proto. See S06 for a discussion of the semantics of {*}. (Unlike a proto sub, a proto regex automatically remembers the return values from {*} because they are carried along with the match cursor.)

Variable matches are considered declarative if and only if the variable is known to represent a constant, Otherwise they are procedural. Note that role parameters (if readonly) are considered constant declarations for this purpose despite the absence of an explicit constant declarator, since roles themselves are immutable, and will presumably be replacing the parameter with a constant value when composed (if the value passed is a constant). Macros instantiated with constants would also make those constants eligible for declarative treatment.

Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.

For matched pairs, the first character after < determines the nature of the assertion:

If the first character is whitespace, the angles are treated as an ordinary "quote words" array literal.

< adam & eve > # equivalent to [ 'adam' | '&' | 'eve' ]

Note that the space before the ending > is optional and therefore < adam & eve> would be acceptable.

A leading alphabetic character means it's a capturing grammatical assertion (i.e. a subrule or a named character class - see below):

/ <sign>? <mantissa> <exponent>? /

The first character after the identifier determines the treatment of the rest of the text before the closing angle. The underlying semantics is that of a function or method call, so if the first character is a left parenthesis, it really is a call to either a method or function:

<foo('bar')>

If the first character after the identifier is an =, then the identifier is taken as an alias for what follows. In particular,

<foo=bar>

is just shorthand for

$<foo> = <bar>

Note that this aliasing does not modify the original <bar> capture. To rename an inherited method capture without using the original name, use the dot form described below on the capture you wish to suppress. That is,

<foo=.bar>

desugars to:

$<foo> = <.bar>

Likewise, to rename a lexically scoped regex explicitly, use the & form described below. That is,

<foo=&bar>

desugars to:

$<foo> = <&bar>

Multiple aliases are allowed, so

<foo=pub=bar>

is short for

$<foo> = $<pub> = <bar>

Similarly, you can alias other assertion, e.g.:

<foo=[abc]> # a character class, same as $<foo>=<[abc]>
<foo=:Letter> # a Unicode property, same as $<foo>=:Letter>
<foo=:!Letter> # a negated Unicode property lookup

If the first character after the identifier is whitespace, the subsequent text (following any whitespace) is passed as a regex, so:

<foo bar>

is more or less equivalent to

<foo(/bar/)>

To pass a regex with leading whitespace you must use the parenthesized form.

If the first character is a colon followed by whitespace, the rest of the text is taken as a list of arguments to the method, just as in ordinary Perl syntax. So these mean the same thing:

Subrule matches are considered declarative to the extent that the front of the subrule is itself considered declarative. If a subrule contains a sequence point, then so does the subrule match. Longest-token matching does not proceed past such a subrule, for instance.

This form always gives preference to a lexically scoped regex declaration, dispatching directly to it as if it were function. If there is no such lexical regex (or lexical method) in scope, the call is dispatched to the current grammar, assuming there is one. That is, if there is a my regex foo visible from the current lexical scope, then

<foo(1,2,3)>

means the same as

<foo=&foo(1,2,3)>

However, if there is no such lexically scoped regex (and note that within a grammar, regexes are installed as methods which have no lexical alias by default), then the call is dispatched as a normal method on the current Cursor (which will fail if you're not currently within a grammar). So in that case:

<foo(1,2,3)>

means the same as:

<foo=.foo(1,2,3)>

A call to <foo> will fail if there is neither any lexically scoped routine of that name it can call, nor any method of that name that be reached via method dispatch. (The decision of which dispatcher to use is made at compile time, not at run time; the method call is not a fallback mechanism.)

A leading . explicitly calls a method as a subrule; the fact that the initial character is not alphanumeric also causes the named assertion to not capture what it matches (see "Subrule captures". For example:

The assertion is otherwise parsed identically to an assertion beginning with an identifier, provided the next thing after the dot is an identifier. As with the identifier form, any extra arguments pertaining to the matching engine are automatically supplied to the argument list via the implicit Cursor invocant. If there is no current class/grammar, or the current class is not derived from Cursor, the call is likely to fail.

If the dot is not followed by an identifier, it is parsed as a "dotty" postfix of some type, such as an indirect method call:

<.$indirect(@args)>

As with all regex matching, the current match state (some derivative of Cursor) is passed as the first argument, which in this case is simply the method's invocant. The method is expected to return a lazy list of new match state objects, or Nil if the match fails entirely. Ratcheted routines will typically return a list containing only one match.

Whereas a leading . unambiguously calls a method, a leading & unambiguously calls a routine instead. Such a regex routine must be declared (or imported) with my or our scoping to make its name visible to the lexical scope, since by default a regex name is installed only into the current class's metaobject instance, just as with an ordinary method. The routine serves as a kind of private submethod, and is called without any consideration of inheritance. It must still take a Cursor as its first argument (which it can think of as an invocant if it likes), and must return the new match state as a cursor object. Hence,

where $¢ represents the current incoming match state, and the routine must return Nil for failure, or a lazy list of one or more match states (Cursor-derived objects) for successful matches.

As with the . form, an explicit & suppresses capture.

Note that all normal Regex objects are really such routines in disguise. When you say:

rx/stuff/

you're really declaring an anonymous method, something like:

my $internal = anon regex :: ($¢: ) { stuff }

and then passing that object off to someone else who will call it indirectly. In this case, the method is installed neither into a class nor into a lexical scope, but as long as the value stays live somehow, it can still be called indirectly (see below).

A leading $ indicates an indirect subrule call. The variable must contain either a Regex object (really an anonymous method--see above), or a string to be compiled as the regex. The string is never matched literally.

If the compilation of the string form fails, the error message is converted to a warning and the assertion fails.

The indirect subrule assertion is not captured. (No assertion with leading punctuation is captured by default.) You may always capture it explicitly, of course:

/ <name=$rx> /

An indirect subrule is always considered procedural, and may not participate in longest-token matching.

A leading :: indicates a symbolic indirect subrule:

/ <::($somename)> /

The variable must contain the name of a subrule. By the rules of single method dispatch this is first searched for in the current grammar and its ancestors. If this search fails an attempt is made to dispatch via MMD, in which case it can find subrules defined as multis rather than methods. This form is not captured by default. It is always considered procedural, not declarative.

A leading @ matches like a bare array except that each element is treated as a subrule (string or Regex object) rather than as a literal. That is, a string is forced to be compiled as a subrule instead of being matched literally. (There is no difference for a Regex object.)

This assertion is not automatically captured.

The use of a hash as an assertion is reserved.

A leading { indicates code that produces a regex to be interpolated into the pattern at that point as a subrule:

/ (<.ident>) <{ %cache{$0} //= get_body_for($0) }> /

The closure is guaranteed to be run at the canonical time; it declares a sequence point, and is considered to be procedural.

In any case of regex interpolation, if the value already happens to be a Regex object, it is not recompiled. If it is a string, the compiled form is cached with the string so that it is not recompiled next time you use it unless the string changes. (Any external lexical variable names must be rebound each time though.) Subrules may not be interpolated with unbalanced bracketing. An interpolated subrule keeps its own inner match results as a single item, so its parentheses never count toward the outer regexes groupings. (In other words, parenthesis numbering is always lexically scoped.)

Unlike closures, code assertions are considered declarative; they are not guaranteed to be run at the canonical time if the optimizer can prove something later can't match. So you can sneak in a call to a non-canonical closure that way:

token { foo .* <?{ do { say "Got here!" } or 1 }> .* bar }

The do block is unlikely to run unless the string ends with "bar".

A leading [ indicates an enumerated character class. Ranges in enumerated character classes are indicated with ".." rather than "-".

However, in order to combine classes you must prefix a named character class with + or -. Whitespace is required before any - that would be misparsed as an identifier extender.

One can use character classes as assertions. The following two expressions are equivalent and match a variable that does not start with a sigil assuming that <variable> can also match a sigil less variable.

<?[$@%&]> <variable>
<?before <[$@%&]> > <variable>

Unicode properties are indicated by use of pair notation in place of a normal rule name:

The pair value is smartmatched against the value in the Unicode Character Database.

<:Nv(0 ^..^ 1)> # the char has a proper fractional value

As a particular case of smartmatching, TR18 section 2.6 is satisfied with a pattern as the argument:

<:name(/^LATIN LETTER.*P$/)>

Multiple of these terms may be combined with pluses and minuses:

<+ :HexDigit - :Upper >

Terms may also be combined using & for set intersection, | for set union, and ^ for symmetric set difference. Parens may be used for grouping. (Square brackets always quote literal characters (including backslashed literal forms), and may not be nested, unlike the suggested notation in TR18 section 1.3.) The precedence of the operators is the same as the correspondingly named operators in "Operator precedence" in S03, even though they have somewhat different semantics.

Extra long characters may be entered by quoting them and including them via intersection. Any quoted characters will be treated as "longest tokens" when appropriate. Here 'll' would be recognized in preference to 'l':

/ <[ a..z ] | 'ñ' | 'ch' | 'll' | 'rr'>

Note that a negated character class containing "long characters" always advances by a single character.

When any character constructor such as \c, \x, or \o contains multiple values separated by commas, these are treated as "long characters". So you could add a \c[13,10] to the list above to match CRLF as a long character.

A consequence of this is that the negated form advances by a single position (matching as . does) when the long character doesn't match as a whole. Hence, this matches:

"\c[13,13,10,10]" ~~ /\C[13,10]* \c[13,10] \C[13,10]/;

If you want it to mean \C13\C10 instead, then you can just write it that way.

A leading ! indicates a negated meaning (always a zero-width assertion):

Note that <!alpha> is different from <-alpha>. /<-alpha>/ is a complemented character class equivalent to /<!before <alpha>> ./, whereas <!alpha> is a zero-width assertion equivalent to a /<!before <alpha>>/ assertion.

Note also that as a metacharacter ! doesn't change the parsing rules of whatever follows (unlike, say, + or -).

A leading ? indicates a positive zero-width assertion, and like ! merely reparses the rest of the assertion recursively as if the ? were not there. In addition to forcing zero-width, it also suppresses any named capture:

It is legal to use any of these assertions as named captures by omitting the punctuation at the front. However, capture entails some overhead in both memory and computation, so in general you want to suppress that for data you aren't interested in preserving.

The after assertion implements lookbehind by reversing the syntax tree and looking for things in the opposite order going to the left. It is illegal to do lookbehind on a pattern that cannot be reversed.

Note: the effect of a forward-scanning lookbehind at the top level can be achieved with:

/ .*? prestuff <( mainpat )> /

A leading * indicates that the following pattern allows a partial match. It always succeeds after matching as many characters as possible. (It is not zero-width unless 0 characters match.) For instance, to match a number of abbreviations, you might write any of:

The pattern is restricted to declarative forms that can be rewritten as nested optional character matches. Sequence information may not be discarded while making all following characters optional. That is, it is not sufficient to rewrite:

<*xyz>

as:

x? y? z? # bad, would allow xz

Instead, it must be implemented as:

[x [y z?]?]? # allow only x, xy, xyz (and '')

Explicit quantifiers are allowed on single characters, so this:

<* a b+ c | ax*>

is rewritten as something like:

[a [b+ c?]?]? | [a x*]?

In the latter example we're assuming the DFA token matcher is going to give us the longest match regardless. It's also possible that quantified multi-character sequences can be recursively remapped:

[Conjecture: depending on how fancy we get, we might (or might not) be able to autodetect ambiguities in <*@abbrev> and refuse to generate ambiguous abbreviations (although exact match of a shorter abbrev should always be allowed even if it's the prefix of a longer abbreviation). If it is not possible, then the user will have to check for ambiguities after the match. Note also that the array form is assuming the array doesn't change often. If it does, the longest-token matcher has to be recalculated, which could get expensive.]

A leading ~~ indicates a recursive call back into some or all of the current rule. An optional argument indicates which subpattern to re-use, and if provided must resolve to a single subpattern. If omitted, the entire pattern is called recursively:

calls back into this anonymous rule as a subrule, and is implicitly anchored to the end of the operator as any other subrule would be. Despite the fact that the outer rule scans the string, the inner call to it does not.

Note that a consequence of the previous section is that you also get

<!~~>

for free, which fails if the current rule would match again at this location.

A leading | indicates some kind of a zero-width boundary. You can refer to backslash sequences with this syntax; <|h> will match between a \h and a \H, for instance. Some examples:

A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. When matched, these behave as assertions that are always true, but have the side effect of setting the .from and .to attributes of the match object. That is:

/ foo <( \d+ )> bar /

is equivalent to:

/ <?after foo> \d+ <?before bar> /

except that the scan for "foo" can be done in the forward direction, while a lookbehind assertion would presumably scan for \d+ and then match "foo" backwards. The use of <(...)> affects only the positions of the beginning and ending of the match, and anything calculated based on those positions. For instance, after the match above, $() contains only the digits matched, and $/.to is pointing to after the digits. Other captures (named or numbered) are unaffected and may be accessed through $/.

When used directly within quantifiers (that is, within quantified square brackets), there is only one Match object to set .from/.to on, so the <( token always sets .from to the leftmost matching position, while )> always sets .to to the rightmost position. However, the situation is different for capturing parentheses. When used within parentheses (whether or not the parens are quantified), the Match being generated by each dynamic capture serves as the target, so each such capturing group sets its own .from/.to. Hence, if the group is quantified, each capture sets its own boundaries independently.

These tokens are considered declarative.

A « or << token indicates a left word boundary. A » or >> token indicates a right word boundary. (As separate tokens, these need not be balanced.) Perl 5's \b is replaced by a <|w> "word boundary" assertion, while \B becomes <!|w>. (None of these are dependent on the definition of <.ws>, but only on the \w definition of "word" characters. Non-space mark characters are ignored in calculating word properties of the preceding character. See TR18 1.4.)

To match Unicode alphabetic characters without the underscore, use <:alpha>.

digit

Match a single digit.

xdigit

Match a single hexadecimal digit.

print

Match a single printable character.

graph

Match a single "graphical" character.

cntrl

Match a single "control" character (equivalent to the <:Cc> property). A control character is usually one that doesn't produce output as such but instead controls the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than 32 are usually classified as control characters (assuming ASCII, the ISO Latin character sets, and Unicode), as is the character with the ord() value of 127 (DEL). The characters from 128 to 159 are also control characters.

punct

Match a single punctuation character (that is, any character from the Unicode General Category "Punctuation").

alnum

Match a single alphanumeric character. This is equivalent to <+alpha +digit> .

wb

Returns a zero-width match that is true at word boundaries. A word boundary is a spot with a "\w" on one side and a "\W" on the other side (in either order), counting the beginning and end of the string as matching "\W".

ww

Matches between two word characters (zero-width match).

ws

Matches required whitespace between two word characters, optional whitespace otherwise. This is roughly equivalent to <!ww> \s* (ws isn't required to use the ww subrule).

space

Match a single whitespace character (same as \s ).

blank

Match a single "blank" character -- in most locales, this corresponds to space and tab.

before pattern

Perform lookahead -- i.e., check if we're at a position where pattern matches. Returns a zero-width Match object on success.

after pattern

Perform lookbehind -- i.e., check if the string before the current position matches <pattern> (anchored at the end). Returns a zero-width Match object on success.

Many \p and \P properties become intrinsic grammar rules such as (<alpha> and <-alpha>). They may be combined using the above-mentioned character class notation: <[-]+alpha+digit>. Regardless of the higher-level character class names, all low-level Unicode properties are always available with a prefix of colon, that is, in pair notation within the angle brackets. Hence, <+:Lu+:Lt> is equivalent to <+upper+title>.

The \L...\E, \U...\E, and \Q...\E sequences are gone. The single-character case modifiers \l and \u are also gone. In the rare cases that need them you can use <{ lc $regex }>, <{tc $word}>, etc.

As mentioned above, the \b and \B word boundary assertions are gone, and are replaced with <|w> (or <wb>) and <!|w> (or <!wb>) zero-width assertions.

The \G sequence is gone. Use :p instead. (Note, however, that it makes no sense to use :p within a pattern, since every internal pattern is implicitly anchored to the current position.) See the at assertion below.

Backreferences (e.g. \1, \2, etc.) are gone; $0, $1, etc. can be used instead, because variables are no longer interpolated.

Numeric variables are assumed to change every time and therefore are considered procedural, unlike normal variables.

New backslash sequences, \h and \v, match horizontal and vertical whitespace respectively, including Unicode. Horizontal whitespace is defined as anything matching \s that doesn't also match \v. Vertical whitespace is defined as any of:

Note that U+000D (CARRIAGE RETURN) is considered vertical whitespace despite the fact that it only moves the "carriage" horizontally.

\s now matches any Unicode whitespace character.

The new backslash sequence \N matches anything except a logical newline; it is the negation of \n.

Other new capital backslash sequences are also the negations of their lowercase counterparts:

\H matches anything but horizontal whitespace.

\V matches anything but vertical whitespace.

\T matches anything but a tab.

\R matches anything but a return.

\F matches anything but a formfeed.

\E matches anything but an escape.

\X... matches anything but the specified character (specified in hexadecimal).

Backslash escapes for literal characters in ordinary strings are allowed in regexes (\a, \x, etc.). However, the exception to this rule is \b, which is disallowed in order to avoid conflict with its former use as a word boundary assertion. To match a literal backspace, use \c8, \x8, or a double-quoted \b.

For historical and convenience reasons, the following character classes are available as backslash sequences:

\d <digit> A digit
\D <-digit> A nondigit
\w <alnum> A word character
\W <-alnum> A non-word character
\s A whitespace character
\S A non-whitespace character
\h A horizontal whitespace
\H A non-horizontal whitespace
\v A vertical whitespace
\V A non-vertical whitespace

You may not use whitespace or alphanumerics for delimiters. Space is optional unless needed to distinguish from modifier arguments or function parens. So you may use parens as your rx delimiters, but only if you interpose whitespace:

rx ( pattern ) # okay
rx( 1,2,3 ) # tries to call rx function

(This is true for all quotelike constructs in Perl 6.)

The rx form may be used directly as a pattern anywhere a normal // match can. The regex form is really a method definition, and must be used in such a way that the grammar class it is to be used in is apparent.

Space is necessary after the final modifier if you use any bracketing character for the delimiter. (Otherwise it would be taken as an argument to the modifier.)

You may not use colons for the delimiter. Space is allowed between modifiers:

$regex = rx :s :i / my name is (.*) /;

The name of the constructor was changed from qr because it's no longer an interpolating quote-like operator. rx is short for regex, (not to be confused with regular expressions, except when they are).

As the syntax indicates, it is now more closely analogous to a sub {...} constructor. In fact, that analogy runs very deep in Perl 6.

Just as a raw {...} is now always a closure (which may still execute immediately in certain contexts and be passed as an object in others), so too a raw /.../ is now always a Regex object (which may still match immediately in certain contexts and be passed as an object in others).

Specifically, a /.../ matches immediately in a value context (sink, Boolean, string, or numeric), or when it is an explicit argument of a ~~. Otherwise it's a Regex constructor identical to the explicit regex form. So this:

$var = /pattern/;

no longer does the match and sets $var to the result. Instead it assigns a Regex object to $var.

When you call my_grep, the first argument is bound in item context, so passing {...} or /.../ produces a Code or Regex object, which the switch statement then selects upon. (Normal grep just lets a smartmatch operator do all the work.)

Just as rx has variants, so does the regex declarator. In particular, there are two special variants for use in grammars: token and rule.

A token declaration:

token ident { [ <alpha> | \- ] \w* }

never backtracks by default. That is, it likes to commit to whatever it has scanned so far. The above is equivalent to

regex ident { [ <alpha>: | \-: ]: \w*: }

but rather easier to read. The bare *, +, and ? quantifiers never backtrack in a token. In normal regexes, use *:, +:, or ?: to prevent any backtracking into the quantifier. If you want to explicitly backtrack, append either a ? or a ! to the quantifier. The ? forces frugal matching as usual, while the ! forces greedy matching. The token declarator is really just short for

regex :ratchet { ... }

The other is the rule declarator, for declaring non-terminal productions in a grammar. Like a token, it also does not backtrack by default. In addition, a rule regex also assumes :sigspace. A rule is really short for:

regex :ratchet :sigspace { ... }

The Perl 5 ?...? syntax (succeed once) was rarely used and can be now emulated more cleanly with a state variable:

$result = do { state $x ||= m/ pattern /; } # only matches first time

To reset the pattern, simply say $x = 0. Though if you want $x visible you'd have to avoid using a block:

Within those portions of a pattern that are considered procedural rather than declarative, you may control the backtracking behavior.

By default, backtracking is greedy in rx, m, s, and the like. It's also greedy in ordinary regex declarations. In rule and token declarations, backtracking must be explicit.

To force the preceding atom to do frugal backtracking (also sometimes known as "eager matching" or "minimal matching"), append a :? or ? to the atom. If the preceding token is a quantifier, the : may be omitted, so *? works just as in Perl 5.

To force the preceding atom to do greedy backtracking in a spot that would default otherwise, append a :! to the atom. If the preceding token is a quantifier, the : may be omitted. (Perl 5 has no corresponding construct because backtracking always defaults to greedy in Perl 5.)

To force the preceding atom to do no backtracking, use a single : without a subsequent ? or !. Backtracking over a single colon causes the regex engine not to retry the preceding atom:

Note that you can still back into the "then" part of such an alternation, so you may also need to put : after it if you also want to disable that. If an explicit or implicit :ratchet has disabled backtracking by supplying an implicit :, you need to put an explicit ! after the alternation to enable backing into, say, the <foo> rule above.

::> does nothing if there is no current temporal alternation. "Current" is defined dynamically, not lexically. A ::> in a subrule will affect the enclosing alternation.

Evaluating a triple colon throws away all saved choice points since the current regex was entered. Backtracking to (or past) this point will fail the rule outright (no matter where in the regex it occurs):

(i.e. using an unquoted reserved word as an identifier is not permitted)

Evaluating a <commit> assertion throws away all saved choice points since the start of the entire match. Backtracking to (or past) this point will fail the entire match, no matter how many subrules down it happens:

(i.e. using a reserved word as a subroutine name is instantly fatal to the surrounding match as well)

If commit is given an argument, it's the name of a calling rule that should be committed:

<commit('infix')>

A <cut> assertion always matches successfully, and has the side effect of logically deleting the parts of the string already matched. Whether this actually frees up the memory immediately may depend on various interactions among your backreferences, the string implementation, and the garbage collector. In any case, the string will report that it has been chopped off on the front. It's illegal to use <cut> on a string that you do not have write access to.

Attempting to backtrack past a <cut> causes the complete match to fail (like backtracking past a <commit>). This is because there's now no preceding text to backtrack into. This is useful for throwing away successfully processed input when matching from an input stream or an iterator of arbitrary length.

These keyword-declared regexes are officially of type Method, which is derived from Routine.

In general, the anchoring of any subrule call is controlled by its calling context. When a regex, token, or rule method is called as a subrule, the front is anchored to the current position (as with :p), while the end is not anchored, since the calling context will likely wish to continue parsing. However, when such a method is smartmatched directly, it is automatically anchored on both ends to the beginning and end of the string. Thus, you can do direct pattern matching by using an anonymous regex routine as a standalone pattern:

The basic rule of thumb is that the keyword-defined methods never do implicit .*?-like scanning, while the m// and s/// quotelike forms do such scanning in the absence of explicit anchoring.

The rx// and // forms can go either way: they scan when used directly within a smartmatch or boolean context, but when called indirectly as a subrule they do not scan. That is, the object returned by rx// behaves like m// when used directly, but like regex{} when used as a subrule:

Since "longest-token matching" is a long phrase, we will usually refer to this idea as LTM. The basic notion is that LTM is how people tend to parse text in their heads, so the computer ought to try to do the same. And parsing with LTM is all about how the computer decides which alternative of a set of alternatives is going to match.

Instead of representing temporal alternation as it does in Perl 5, in Perl 6 | represents logical alternation with declarative longest-token semantics. (You may now use || to indicate the old temporal alternation. That is, | and || now work within regex syntax much the same as they do outside of regex syntax, where they represent junctional and short-circuit OR. This includes the fact that | has tighter precedence than ||.)

Historically regex processing has proceeded in Perl via a backtracking NFA algorithm. This is quite powerful, but many parsers work more efficiently by processing rules in parallel rather than one after another, at least up to a point. If you look at something like a yacc grammar, you find a lot of pattern/action declarations where the patterns are considered in parallel, and eventually the grammar decides which action to fire off. While the default Perl view of parsing is essentially top-down (perhaps with a bottom-up "middle layer" to handle operator precedence), it is extremely useful for user understanding if at least the token processing proceeds deterministically. So for regex matching purposes we define token patterns as those patterns that can be matched without potential side effects or self-reference. (Since whitespace often has side effects at line transitions, it is usually excluded from such patterns, give or take a little lookahead.) Basically, Perl automatically derives a lexer from the grammar without you having to write one yourself.

To that end, every regex in Perl 6 is required to be able to distinguish its "pure" patterns from its actions, and return its list of initial token patterns (transitively including the token patterns of any subrule called by the "pure" part of that regex, but not including any subrule more than once, since that would involve self reference, which is not allowed in traditional regular expressions). A logical alternation using | then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically.

However, if two alternatives match at the same length, the tie is broken first by specificity. The alternative that starts with the longest fixed string wins; that is, an exact match counts as closer than a match made using character classes. If that doesn't work, the tie is broken by one of two methods. If the alternatives are in different grammars, standard MRO (method resolution order) determines which one to try first. If the alternatives are in the same grammar file, the textually earlier alternative takes precedence. (If a grammar's rules are defined in more than one file, the order is undefined, and an explicit assertion must be used to force failure if the wrong one is tried first.)

This longest token prefix corresponds roughly to the notion of "token" in other parsing systems that use a lexer, but in the case of Perl this is largely an epiphenomenon derived automatically from the grammar definition. However, despite being automatically calculated, the set of tokens can be modified by the user; various constructs within a regex declaratively tell the grammar engine that it is finished with the pattern part and starting in on the side effects, so by inserting such constructs the user controls what is considered a token and what is not. The constructs deemed to terminate a token declaration and start the "action" part of the pattern include:

Any :: or ::: backtracking control (but not the : possessive modifier).

Any atom that is quantified with a frugal match (using the ? modifier).

Any {...} action, but not an assertion containing a closure. (The empty closure {} is customarily used to explicitly terminate the pure part of the pattern.) The closure form of the general **{...} quantifier also terminates the longest token, but the closureless forms of quantifier do not.

Any sequential control flow operator such as || or &&.

As a consequence of the previous point, and because the standard grammar's <ws> rule defines whitespace using ||, the longest token is also terminated by any part of the regex or rule that might match whitespace using that rule, including whitespace implicitly matched via :sigspace. (However, token declarations are specifically allowed to recognize whitespace within a token by using such lower-level primitives as \h+ or other character classes.)

Subpatterns (captures) specifically do not terminate the token pattern, but may require a reparse of the token to find the location of the subpatterns. Likewise assertions may need to be checked out after the longest token is determined. (Alternately, if DFA semantics are simulated in any of various ways, such as by Thompson NFA, it may be possible to know when to fire off the assertions without backchecks.)

Greedy quantifiers and character classes do not terminate a token pattern. Zero-width assertions such as word boundaries are also okay.

Because such assertions can be part of the token, the lexer engine must be able to recover from the failure of such an assertion and backtrack to the next best token candidate, which might be the same length or shorter, but can never be longer than the current candidate.

For a pattern that contains a positive lookahead assertion such as <?foo> or <?before \s>, the assertion is assumed to be more specific than the subsequent pattern, so the lookahead's pattern is counted as the final part of the longest token; the longest-token matcher will be smart enough to treat the extra bit as 0-width, that is, to rematch any text traversed by the lookahead when (and if) it continues the match. (Indeed, if the entire lookahead is pure enough to participate in LTM, the rematcher may simply optimize away the rematching, since the lookahead already matched in the LTM engine.)

However, for a pattern that contains a negative lookahead assertion such as <!foo> or <!before \s>, just the opposite is true: the subsequent pattern is assumed to be more specific than the assertion's. So LTM completely ignores negative lookaheads, and continues to look for pure patterns in whatever follows the negative lookahead. You might say that positive lookaheads are opaque to LTM, but negative lookaheads are transparent to LTM. As a consequence, if you wish to write a positive lookahead that is transparent to LTM, you may indicate this with a double negation: <!!foo>. (The optimizer is free to remove the double negation, but not the transparency.)

Oddly enough, the token keyword specifically does not determine the scope of a token, except insofar as a token pattern usually doesn't do much matching of whitespace, and whitespace is the prototypical way of terminating tokens.

The initial token matcher must take into account case sensitivity (or any other canonicalization primitives) and do the right thing even when propagated up to rules that don't have the same canonicalization. That is, they must continue to represent the set of matches that the lower rule would match.

The || form has the old short-circuit semantics, and will not attempt to match its right side unless all possibilities (including all | possibilities) are exhausted on its left. The first || in a regex makes the token patterns on its left available to the outer longest-token matcher, but hides any subsequent tests from longest-token matching. Every || establishes a new longest-token matcher. That is, if you use | on the right side of ||, that right side establishes a new top level scope for longest-token processing for this subexpression and any called subrules. The right side's longest-token automaton is invisible to the left of the || or outside the regex containing the ||.

A successful match always returns a Match object, which is generally also put into $/, a dynamic lexical declared in the outer routine that is calling the regex. (A named regex, token, or rule is a routine, and hence declares its own lexical $/ variable, which always refers to the most recent submatch within the rule, if any.) The current match state is kept in the regex's $¢ variable which will eventually get bound to the user's $/ variable when the match completes.

An unsuccessful match returns Nil (and sets $/ to Nil if the match would have set it).

Notionally, a match object contains (among other things) a boolean success value, an array of ordered submatch objects, and a hash of named submatch objects. (It also optionally carries an abstract object normally used to build up an abstract syntax tree,) To provide convenient access to these various values, the match object evaluates differently in different contexts:

In boolean context it evaluates as true or false (i.e. did the match succeed?):

if /pattern/ {...}
# or:
/pattern/; if $/ {...}

With :global or :overlap or :exhaustive the boolean is allowed to return true on the first match. The Match object can produce the rest of the results lazily if evaluated in list context.

In string context it evaluates to the stringified value of its match, which is usually the entire matched string:

In numeric context it evaluates to the numeric value of its match, which is usually the entire matched string:

$sum += /\d+/;
# or equivalently:
/\d+/; $sum = $sum + $/;

When used as a scalar, a Match object evaluates to itself.

However, sometimes you would like an alternate scalar value to ride along with the match. The Match object itself describes a concrete parse tree, so this extra value is called an abstract object; it rides along as an attribute of the Match object. The .made method by default returns an undefined value. $() is a shorthand for $($/.made // ~$/).

Therefore $() is usually just the entire match string, but you can override that by calling make inside a regex:

This puts the new abstract node into $/.made. An AST node may be of any type. Using the make/.made mechanism, it is convenient to build up an abstract syntax tree of arbitrary node types.

However, the make function is not limited to be used for storing AST nodes and building abstract syntax trees only. This is just a specific Perl 6 internal use of this functionality. Nor does the make function impose any item or list context onto its argument, so if you say something ambiguously listy like

make ()
make @array
make foo()

the value returned from .made will interpolate into a list. To suppress this, use one of these:

make ().item
make []
make $@array
make [@array]
make foo().item
make $(foo())

or use .made.item or a $ variable on the receiving end.

The .ast method is a synonym .made with no differing behavior. It exists both for historical reasons and as a way to indicate to those reading your code a more AST-like use of the made/.make mechanism.

You may also capture a subset of the match using the <(...)> construct:

"foo123bar" ~~ / foo <( \d+ )> bar /
say $(); # says 123

In this case $() is always a string when doing string matching, and a list of one or more elements when doing list matching. This construct does not set the .made attribute.

When used as an array, a Match object pretends to be an array of all its positional captures. Hence

($key, $val) = ms/ (\S+) '=>' (\S+)/;

can also be written:

$result = ms/ (\S+) '=>' (\S+)/;
($key, $val) = @$result;

To get a single capture into a string, use a subscript:

$mystring = "{ ms/ (\S+) '=>' (\S+)/[0] }";

To get all the captures into a string, use a zen slice:

$mystring = "{ ms/ (\S+) '=>' (\S+)/[] }";

Or cast it into an array:

$mystring = "@( ms/ (\S+) '=>' (\S+)/ )";

Note that, as a scalar variable, $/ doesn't automatically flatten in list context. Use @() as a shorthand for @($/) to flatten the positional captures under list context. Note that a Match object is allowed to evaluate its match lazily in list context. Use eager @() to force an eager match.

When used as a hash, a Match object pretends to be a hash of all its named captures. The keys do not include any sigils, so if you capture to variable @<foo> its real name is $/{'foo'} or $/<foo>. However, you may still refer to it as @<foo> anywhere $/ is visible. (But it is erroneous to use the same name for two different capture datatypes.)

Note that, as a scalar variable, $/ doesn't automatically flatten in list context. Use %() as a shorthand for %($/) to flatten as a hash, or bind it to a variable of the appropriate type. As with @(), it's possible for %() to produce its pairs lazily in list context.

The numbered captures may be treated as named, so $<0 1 2> is equivalent to $/[0,1,2]. This allows you to write slices of intermixed named and numbered captures.

The .keys, .values and .kv methods act both on the list and hash part, with the list part coming first.

'abcd' ~~ /(.)(.)**2 <alpha>/;
say ~$/.keys; # 0 1 alpha

In ordinary code, variables $0, $1, etc. are just aliases into $/[0], $/[1], etc. Hence they will all be undefined if the last match failed (unless they were explicitly bound in a closure without using the let keyword).

This last value may correspond to either $¢.from or $¢.to depending on whether the match is proceeding in a forward or backward direction (the latter case arising inside an <?after ...> assertion).

As described above, a Match in list context returns its positional captures. However, sometimes you'd rather get a flat list of tokens in the order they occur in the text. The .caps method returns a list of every capture in order, regardless of how it was otherwise bound into named or numbered captures. (Other than order, there is no new information here; all the elements of the list are the very same Match objects that bound elsewhere.) The bindings are actually returned as key/value pairs where the key is the name or number under which the match object was bound, and the value is the match object itself.

In addition to returning those captured Match objects, the .chunks method also returns all the interleaved "noise" between the captures. As with .caps, the list elements are in the order they were originally in the text. The interleaved bits are also returned as pairs, where the key is '~' and the value is a simple Match object containing only the string, even if unbound subrules such as .ws were called to traverse the text in the first place. Calling .made on such a Match object always returns a Str.

A warning will be issued if either .caps or .chunks discovers that it has overlapping bindings. In the absence of such overlap, .chunks guarantees to map every part of its matched string (between .from and .to) to exactly one element of its returned matches, so coverage is complete.

[Conjecture: we could also have .deepcaps and .deepchunks that recursively expand any capture containing submatches. Presumably the keys of such returned chunks would indicate the "pedigree" of bindings in the parse tree.]

All match attempts--successful or not--against any regex, subrule, or subpattern (see below) return an object that can be evaluated as a boolean. (This object will be either a Match or a Nil.) That is:

$match_obj = $str ~~ /pattern/;
say "Matched" if $match_obj;

This returned object is also automatically bound to the lexical $/ variable of the current surroundings regardless of success. That is:

$str ~~ /pattern/;
say "Matched" if $/;

Inside a regex, the $¢ variable holds the current regex's incomplete Match object, known as a match state (of type Cursor). Generally this should not be modified unless you know how to create and propagate match states. All regexes actually return match states even when you think they're returning something else, because the match states keep track of the successes and failures of the pattern for you.

Fortunately, when you just want to return a different abstract result along with the default concrete Match object, you may associate your return value with the current match state using the make function, which works something like a return, but doesn't clobber the match state:

The value of any Match object (such as an abstract object) is available via the .made method. Hence these abstract objects can be managed independently of the returned cursor objects.

The current cursor object must always be derived from Cursor, or the match will not work. However, within that constraint, the actual type of the current cursor defines which language you are currently parsing. When you enter the top of a grammar, this cursor generally starts out as an object whose type is the name of the grammar you are in, but the current language can be modified by various methods as they mutate the current language by returning cursor objects blessed into a different type, which may or may not be derived from the current grammar.

Each subpattern in a regex produces a Match object if it is successfully matched.

Each subpattern is either explicitly assigned to a named destination or implicitly added to an array of matches.

For each subpattern that is not explicitly given a name, the subpattern's Match object is pushed onto the array inside the outer Match object belonging to the surrounding scope (known as its parent Match object). The surrounding scope may be either the innermost surrounding subpattern (if the subpattern is nested) or else the entire regex itself.

Like all captures, these assignments to the array are hypothetical, and are undone if the subpattern is backtracked.

then the Match objects representing the matches made by subpat-B and subpat-C would be successively pushed onto the array inside subpat- A's Match object. Then subpat-A's Match object would itself be pushed onto the array inside the Match object for the entire regex (i.e. onto $/'s array).

Note that, in Perl 6, the numeric capture variables start from $0, not $1, with the numbers corresponding to the element's index inside $/.

The array elements of the regex's Match object (i.e. $/) store individual Match objects representing the substrings that were matched and captured by the first, second, third, etc. outermost (i.e. unnested) subpatterns. So these elements can be treated like fully fledged match results. For example:

If a subpattern is directly quantified with ?, it either produces a single Match object, or Nil. If a subpattern is directly quantified using any other quantifier, it never produces a single Match object. Instead, it produces a list of Match objects corresponding to the sequence of individual matches made by the repeated subpattern. If we need to distinguish the two categories, ? is an item quantifier, while *, +, and ** are called list quantifiers.

If 0 values match, the captured value depends on which quantifier is used. If the quantifier is ?, a Nil is captured if it matched 0 times. If the quantifier is *, the empty list, (), is captured instead. (Nothing is captured by the + quantifier if it matches 0 times, since it causes backtracking, but the capture variable should return Nil if an attempt is made to use it after an unsuccessful match.) A ** quantifier returns () as * does if it the minimum of its range is 0, and backtracks otherwise.

Note that ** 0..1 is always considered a list quantifier, unlike ?.

The rationale for treating ? as an item quantifier is to make it consistent with how $object.?meth is defined, and to reduce the need for gratuitous .[0] subscripts, which is surprising to most people. Now that Nil is considered undefined rather than a synonym for (), it's easy to use $0 // "default" or some such to dereference a capture safely.

Because a list-quantified subpattern returns a list of Match objects, the corresponding array element for the quantified capture will store a (nested) array rather than a single Match object. For example:

Non-capturing brackets don't create a separate nested lexical scope, so the two subpatterns inside them are actually still in the regex's top-level scope, hence their top-level designations: $0 and $1.

However, because the two subpatterns are inside a quantified structure, $0 and $1 will each contain an array. The elements of that array will be the submatches returned by the corresponding subpatterns on each iteration of the non-capturing parentheses. For example:

In contrast, if the outer quantified structure is a capturing structure (i.e. a subpattern) then it will introduce a nested lexical scope. That outer quantified structure will then return an array of Match objects representing the captures of the inner parens for every iteration (as described above). That is:

Additionally, the sublists are kept "in sync" with each other, as each empty match, in the case of $0[1] in our example if a : is followed by a newline character, will have a corresponding Nil in the given list.

The index of a given subpattern can always be statically determined, but is not necessarily unique nor always monotonic. The numbering of subpatterns restarts in each lexical scope (either a regex, a subpattern, or the branch of an alternation).

In particular, the index of capturing parentheses restarts after each | or || (but not after each & or &&). Hence:

Just like subpatterns, each successfully matched subrule within a regex produces a Match object. But, unlike subpatterns, that Match object is not assigned to the array inside its parent Match object. Instead, it is assigned to an entry of the hash inside its parent Match object. For example:

The hash entries of a Match object can be referred to using any of the standard hash access notations ($/{'foo'}, $/<bar>, $/«baz», etc.), or else via corresponding lexically scoped aliases ($<foo>, $«bar», $<baz>, etc.) So the previous example also implies:

If a subrule appears two (or more) times in any branch of a lexical scope (i.e. twice within the same subpattern and alternation), or if the subrule is list-quantified anywhere within a given scope (that is, by any quantifier other than ?), then its corresponding hash entry is always assigned an array of Match objects rather than a single Match object.

Successive matches of the same subrule (whether from separate calls, or from a single quantified repetition) append their individual Match objects to this array. For example:

if ms/ mv <file> <file> / {
$from = $<file>[0];
$to = $<file>[1];
}

(Note, for clarity we are ignoring whitespace subtleties here--the normal sigspace rules would require space only between alphanumeric characters, which is wrong. Assume that our file subrule deals with whitespace on its own.)

Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.

then the outer capturing parens no longer capture into the array of $/ as unaliased parens would. Instead the aliased parens capture into the hash of $/; specifically into the hash element whose key is the alias name.

$/<key> will contain the Match object that would previously have been placed in $/[0].

$/<key>[0] will contain the A-E letter,

$/<key>[1] will contain the digits,

$/<key>[2] will contain the optional X.

Another way to think about this behavior is that aliased parens create a kind of lexically scoped named subrule; that the contents of the parentheses are treated as if they were part of a separate subrule whose name is the alias.

then the corresponding $/<key>Match object contains only the string matched by the non-capturing brackets.

In particular, the array of the $/<key> entry is empty. That's because square brackets do not create a nested lexical scope, so the subpatterns are unnested and hence correspond to $0, $1, and $2, and not to $/<key>[0], $/<key>[1], and $/<key>[2].

In other words:

$/<key> will contain the complete substring matched by the square brackets (in a Match object, as described above),

Hence aliasing a dotted subrule changes the destination of the subrule's Match object. This is particularly useful for differentiating two or more calls to the same subrule in the same scope. For example:

the behavior is exactly the same as for a named alias (i.e. the various cases described above), except that the resulting Match object is assigned to the corresponding element of the appropriate array rather than to an element of the hash.

If any numbered alias is used, the numbering of subsequent unaliased subpatterns in the same scope automatically increments from that alias number (much like enum values increment from the last explicit value). That is:

The non-capturing brackets don't introduce a scope, so the subpatterns within them are at regex scope, and hence numbered at the top level. Aliasing the square brackets to $1 means that the next subpattern at the same level (i.e. the (<[A..E]>)) is numbered sequentially (i.e. $2), etc.

All of the above semantics apply equally to aliases which are bound to quantified structures.

The only difference is that, if the aliased construct is a subrule or subpattern, that quantified subrule or subpattern will have returned a list of Match objects (as described in "Quantified subpattern captures" and "Repeated captures of the same subrule"). So the corresponding array element or hash entry for the alias will contain an array, instead of a single Match object.

In other words, aliasing and quantification are completely orthogonal. For example:

if ms/ mv $0=<.file>+ / {
# <file>+ returns a list of Match objects,
# so $0 contains an array of Match objects,
# one for each successful call to <file>

# $/<file> does not exist (it's suppressed by the dot)
}

if m/ mv \s+ $<from>=(\S+ \s+)* / {
# Quantified subpattern returns a list of Match objects,
# so $/<from> contains an array of Match
# objects, one for each successful match of the subpattern

# $0 does not exist (it's pre-empted by the alias)
}

Note, however, that a set of quantified non-capturing brackets always returns a single Match object which contains only the complete substring that was matched by the full set of repetitions of the brackets (as described in "Named scalar aliases applied to non-capturing brackets"). For example:

Using the @alias= notation instead of a $alias= mandates that the corresponding hash entry or array element always receives an array of Match objects, even if the construct being aliased would normally return a single Match object. This is useful for creating consistent capture semantics across structurally different alternations (by enforcing array captures in all branches):

If an array alias is applied to a quantified pair of non-capturing brackets, it captures the substrings matched by each repetition of the brackets into separate elements of the corresponding array. That is:

ms/ mv $<files>=[ f.. \s* ]* /; # $/<files> assigned a single
# Match object containing the
# complete substring matched by
# the full set of repetitions
# of the non-capturing brackets

If an array alias is applied to a quantified pair of capturing parens (i.e. to a subpattern), then the corresponding hash or array element is assigned a list constructed by concatenating the array values of each Match object returned by one repetition of the subpattern. That is, an array alias on a subpattern flattens and collects all nested subpattern captures within the aliased subpattern. For example:

if ms/ $<pairs>=( (\w+) \: (\N+) )+ / {
# Scalar alias, so $/<pairs> is assigned an array
# of Match objects, each of which has its own array
# of two subcaptures...

if ms/ @<pairs>=( (\w+) \: (\N+) )+ / {
# Array alias, so $/<pairs> is assigned an array
# of Match objects, each of which is flattened out of
# the two subcaptures within the subpattern

for @($<pairs>) -> $key, $val {
say "Key: $key";
say "Val: $val";
}
}

Likewise, if an array alias is applied to a quantified subrule, then the hash or array element corresponding to the alias is assigned a list containing the array values of each Match object returned by each repetition of the subrule, all flattened into a single array:

rule pair { (\w+) \: (\N+) \n }

if ms/ $<pairs>=<pair>+ / {
# Scalar alias, so $/<pairs> contains an array of
# Match objects, each of which is the result of the
# <pair> subrule call...

if ms/ mv @<pairs>=<pair>+ / {
# Array alias, so $/<pairs> contains an array of
# Match objects, all flattened down from the
# nested arrays inside the Match objects returned
# by each match of the <pair> subrule...

for @($<pairs>) -> $key, $val {
say "Key: $key";
say "Val: $val";
}
}

In other words, an array alias is useful to flatten into a single array any nested captures that might occur within a quantified subpattern or subrule. Whereas a scalar alias is useful to preserve within a top-level array the internal structure of each repetition.

It is also possible to use a numbered variable as an array alias. The semantics are exactly as described above, with the sole difference being that the resulting array of Match objects is assigned into the appropriate element of the regex's match array rather than to a key of its match hash. For example:

A hash alias causes the corresponding hash or array element in the current scope's Match object to be assigned a (nested) Hash object (rather than an Array object or a single Match object).

If a hash alias is applied to a subrule or subpattern then the first nested numeric capture becomes the key of each hash entry and any remaining numeric captures become the values (in an array if there is more than one).

As with array aliases it is also possible to use a numbered variable as a hash alias. Once again, the only difference is where the resulting Match object is stored:

rule one_to_many { (\w+) \: (\S+) (\S+) (\S+) }

if ms/ %0=<one_to_many>+ / {
# $/[0] contains a hash, in which each key is provided by
# the first subcapture within C<one_to_many>, and each
# value is an array containing the
# subrule's second, third, fourth, etc. subcaptures...

the name of an ordinary variable can be used as an external alias, like so:

m/ mv @OUTER::files=<ident>+ $OUTER::dir=<ident> /

In this case, the behavior of each alias is exactly as described in the previous sections, except that any resulting capture is bound directly (but still hypothetically) to the variable of the specified name that must already exist in the scope in which the regex is declared.

When an entire regex is successfully matched with repetitions (specified via the :x or :g flag) or overlaps (specified via the :ov or :ex flag), it will usually produce a sequence of distinct matches.

A successful match under any of these flags still returns a single Match object in $/. However, this object may represent a partial evaluation of the regex. Moreover, the values of this match object are slightly different from those provided by a non-repeated match:

The boolean value of $/ after such matches is true or false, depending on whether the pattern matched.

The string value is the substring from the start of the first match to the end of the last match (including any intervening parts of the string that the regex skipped over in order to find later matches).

Subcaptures are returned as a multidimensional list, which the user can choose to process in either of two ways. If you refer to @().flat (or just use @() in a flat list context), the multidimensionality is ignored and all the matches are returned flattened (but still lazily). If you refer to lol(), you can get each individual sublist as a List object. As with any multidimensional list, each sublist can be lazy separately.

This creates a Grammar object, whose type denotes the current language being parsed, and from which other grammars may be derived as extended languages. All grammar objects are derived from Cursor, so every grammar object's value embodies the current state of the current match. This new grammar object is then passed as the invocant to the TOP method (regex, token, or rule) of MyGrammar. The default rule name to call can be overridden with the :rule named argument of the parse method. This is useful for unit testing the rules of a grammar. As methods, rules can have arguments, so the :args named argument is used to pass such arguments as a list if necessary.

Grammar objects are considered immutable, so every match returns a different match state, and multiple match states may exist simultaneously. Each such match state is considered a hypothesis on how the pattern will eventually match. A backtrackable choice in pattern matching may be easily represented in Perl 6 as a lazy list of match state cursors; backtracking consists of merely throwing away the front value of the list and continuing to match with the next value. Hence, the management of these match cursors controls how backtracking works, and falls naturally out of the lazy list paradigm.

The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)

If you wish to parse a portion of a text, then use the .subparse method instead. You may pass a :pos argument to start parsing at some position other than 0. You may pass a :rule argument to specify which subrule you're wanting to call. The final position can be determined by examining the returned Match object.

Action objects (provided via the :actions named argument in Grammar.parse) are objects whose methods correspond to the rules in a grammar. When a rule in a grammar matches, any method in the actions object with the same name (if there is one) is used to build the AST for the Match the grammar is building. Action methods should have a single parameter (by convention, $/) that contains the Match object for the rule. Action methods are invoked as soon as the corresponding rule has a successful match, regardless of if the match occurs in a zero-width match or in a backtracking branch that may eventually fail, so state should be tracked via the AST and side effects may cause unexpected behavior.

Action methods are called within the call frame for the rule, so dynamic variables set in the rule are passed along to the action method.

(It is a general policy in Perl 6 that any pragma designed to influence the surface behavior of a keyword is identical to the keyword itself, unless there is good reason to do otherwise. On the other hand, pragmas designed to influence deep semantics should not be named identically, though of course some similarity is good.)

The tr/// quote-like operator now also has a method form called trans(). Its argument is a list of pairs. You can use anything that produces a pair list:

$str.trans( %mapping.pairs );

Use the .= form to do a translation in place:

$str.=trans( %mapping.pairs );

(Perl 6 does not support the y/// form, which was only in sed because they were running out of single letters.)

The two sides of any pair can be strings interpreted as tr/// would:

$str.=trans( 'A..C' => 'a..c', 'XYZ' => 'xyz' );

As a degenerate case, each side can be individual characters:

$str.=trans( 'A'=>'a', 'B'=>'b', 'C'=>'c' );

Whitespace characters are taken literally as characters to be translated from or to. The .. range sequence is the only metasyntax recognized within a string, though you may of course use backslash interpolations in double quotes. If the right side is too short, the final character is replicated out to the length of the left string. If there is no final character because the right side is the null string, the result is deletion instead.

Either or both sides of the pair may also be Array objects:

$str.=trans( ['A'..'C'] => ['a'..'c'], <X Y Z> => <x y z> );

The array version is the underlying primitive form: the semantics of the string form is exactly equivalent to first doing .. expansion and then splitting the string into individual characters and then using that as an array.

The array version can map one-or-more characters to one-or-more characters:

In the case that more than one sequence of input characters matches, the longest one wins. In the case of two identical sequences the first in order wins.

As with the string form, missing righthand elements replicate the final element, and a null array results in deletion instead.

The recognition done by the string and array forms is very basic. To achieve greater power, any recognition element of the left side may be specified by a regex that can do character classes, lookahead, etc.

These submatches are mixed into the overall match in exactly the same way that they are mixed into parallel alternation in ordinary regex processing, so longest token rules apply across all the possible matches specified to the transliteration operator. Once a match is made and transliterated, the parallel matching resumes at the new position following the end of the previous match, even if it matched multiple characters.

If the right side of the arrow is a closure, it is evaluated to determine the replacement value. If the left side was matched by a regex, the resulting match object is available within the closure.

The .match and .subst methods support the adverbs of m// and s/// as named arguments, so you can write

$str.match(/pat/, :g)

as an equivalent to

$str.comb(/pat/, :match)

There is no syntactic sugar here, so in order to get deferred evaluation of the replacement you must put it into a closure. The syntactic sugar is provided only by the quotelike forms. First there is the standard "triple quote" form:

s/pattern/replacement/

Only non-bracket characters may be used for the "triple quote". The right side is always evaluated as if it were a double-quoted string regardless of the quote chosen.

As with Perl 5, a bracketing form is also supported, but unlike Perl 5, Perl 6 uses the brackets only around the pattern. The replacement is then specified as if it were an ordinary item assignment, with ordinary quoting rules. To pick your own quotes on the right just use one of the q forms. The substitution above is equivalent to:

This is not a normal assignment, since the right side is evaluated each time the substitution matches (much like the pseudo-assignment to declarators can happen at strange times). It is therefore treated as a "thunk", that is, it will be called as a chunk of code that creates a dynamic scope but not a lexical scope. (You can also think of a thunk as a closure that uses the current lexical scope parasitically.) In fact, it makes no sense at all to say

s[pattern] = { doit }

because that would try to substitute a closure into the string.

Any scalar assignment operator may be used; the substitution macro knows how to turn

(The actual implementation of s/// must return a Match to make smartmatch work right. The rewrite above merely returns the changed string.)

So, for example, you can multiply every dollar amount by 2 with:

s:g[\$ <( \d+ )>] *= 2

(Of course, the optimizer is free to do something faster than an actual method call.)

You'll note from the last example that substitutions only happen on the "official" string result of the match, that is, the portion of the string between the $/.from and $/.to positions. (Here we set those explicitly using the <(...)> pair; otherwise we would have had to use lookbehind to match the $.)

Please note that the :ii/:samecase and :mm/:samemark switches are really two different modifiers in one, and when the compiler desugars the quote-like forms it distributes semantics to both the pattern and the replacement. That is, :ii on the replacement implies a :i on the pattern, and :mm implies :m. The proper method equivalents to:

It is specifically not required of an implementation that it treat the regexes as generic with respect to case and mark. Retroactive recompilation is considered harmful. If an implementation does do lazy generic case and mark semantics, it is erroneous and non-portable for a program to depend on it.

One other difference between the s/// and .subst forms is that, while .subst returns the modified string (and cannot, therefore, be used as a smart matcher), the s/// form always returns either a Match object to indicate to smartmatch that it was successful, or a Nil value to indicate that it was not.

Likewise, for both m:g matches and s:g substitutions, there may be multiple matches found. These constructs must still continue to work under smartmatching while returning a list of matches. Fortunately, List is one of the distinguished types that a matcher may return to indicate success or failure. So these construct simply return the list of successful matches, which will be empty (and hence false) if no matches occurred.

To anchor to a particular position in the general case you can use the <at($pos)> assertion to say that the current position is the same as the position object you supply. You may set the current match position via the :c and :p modifiers.

However, please remember that in Perl 6 string positions are generally not integers, but objects that point to a particular place in the string regardless of whether you count by bytes or codepoints or graphemes. If used with an integer, the at assertion will assume you mean the current lexically scoped Unicode level, on the assumption that this integer was somehow generated in this same lexical scope. If this is outside the current string's allowed Unicode abstraction levels, an exception is thrown. See S02 for more discussion of string positions.

Buf types are based on fixed-width cells and can therefore handle integer positions just fine, and treat them as array indices. In particular, buf8 (also known as buf) is just an old-school byte string. Matches against Buf types are restricted to ASCII semantics in the absence of an explicit modifier asking for the array's values to be treated as some particular encoding such as UTF-32. (This is also true for those compact arrays that are considered isomorphic to Buf types.) Positions within Buf types are always integers, counting one per unit cell of the underlying array. Be aware that "from" and "to" positions are reported as being between elements. If matching against a compact array @foo, a final position of 42 indicates that @foo[42] was the first element not included.

Anything that can be tied to a string can be matched against a regex. This feature is particularly useful with input streams:

my $stream := cat $fh.lines; # tie scalar to filehandle

# and later...

$stream ~~ m/pattern/; # match from stream

Any non-compact array of mixed strings or objects can be matched against a regex as long as you present them as an object with the Str interface, which does not preclude the object having other interfaces such as Array. Normally you'd use cat to generate such an object:

@array.cat ~~ / foo <,> bar <elem>* /;

The special <,> subrule matches the boundary between elements. The <elem> assertion matches any individual array element. It is the equivalent of the "dot" metacharacter for the whole element.

If the array elements are strings, they are concatenated virtually into a single logical string. If the array elements are tokens or other such objects, the objects must provide appropriate methods for the kinds of subrules to match against. It is an assertion failure to match a string-matching assertion against an object that doesn't provide a stringified view. However, pure object lists can be parsed as long as the match (including any subrules) restricts itself to assertions like:

<.isa(Dog)>
<.does(Bark)>
<.can('scratch')>

It is permissible to mix objects and strings in an array as long as they're in different elements. You may not embed objects in strings, however. Any object may, of course, pretend to be a string element if it likes, and so a Cat object may be used as a substring with the same restrictions as in the main string.

Please be aware that the warnings on .from and .to returning opaque objects goes double for matching against an array, where a particular position reflects both a position within the array and (potentially) a position within a string of that array. Do not expect to do math with such values. Nor should you expect to be able to extract a substr that crosses element boundaries. [Conjecture: Or should you?]

To match against every element of an array, use a hyper operator:

@array».match($regex);

To match against any element of the array, it suffices to use ordinary smartmatching:

To provide implementational freedom, the $/ variable is not guaranteed to be defined until the pattern reaches a sequence point that requires it (such as completing the match, or calling an embedded closure, or even evaluating a submatch that requires a Perl expression for its argument). Within regex code, $/ is officially undefined, and references to $0 or other capture variables may be compiled to produce the current value without reference to $/. Likewise a reference to $<foo> does not necessarily mean $/<foo> within the regex proper. During the execution of a match, the current match state is actually stored in a $¢ variable lexically scoped to an appropriate portion of the match, but that is not guaranteed to behave the same as the $/ object, because $/ is of type Match, while the match state is of a type derived from Cursor.

In any case this is all transparent to the user for simple matches; and outside of regex code (and inside closures within the regex) the $/ variable is guaranteed to represent the state of the match at that point. That is, normal Perl code can always depend on $<foo> meaning $/<foo>, and $0 meaning $/[0], whether that code is embedded in a closure within the regex or outside the regex after the match completes.