The highest number of parentheses. It will end up being identical to nparen, but it is incremented during the initial pass, so that on the second pass (the tree-building), it can distinguish back-references from octal escapes. (The source code to Perl's regex compiler does the same thing.)

An array reference of flag values. When a scope is entered, the top value is copied and pushed onto the stack. When a scope is left, the top value is popped and discarded.

It is important to do this copy-and-push before you do any flag-parsing, if you're adding a handle that might parse flags, because you do not want to accidentally affect the previous scope's flag values.

You may find it helpful to copy these to your sub-class. If you're curious why the regex value is a reference, and thus why I'm using ${&Rx} everywhere, it's because an lvalued subroutine returning a normal scalar doesn't work quite right with a regex that's supposed to update its target's pos(). This method, where it returns a reference to a scalar, makes it work (!).

These functions can only work if called with ampersands, and only if the parser object is the first value in @_. I made sure of this in my code; you should make sure in yours.

Matching against the regex is done in scalar context, globally, like so:

if (${&Rx} =~ m{ \G pattern }xgc) {
# it matched
}

If the match fails, the pos() value won't be reset (due to the /c modifier). Remember to use scalar context. If you need to access capture groups, use the digit variables, but only if you're sure the match succeeded.

This creates a node of package TYPE and sends the constructor whatever other arguments are included. This method takes care of building the proper inheritance for the node; it uses %Regexp::Parser::loaded to keep track of which object classes have been loaded already.

This method creates a method of the parser FLAG_$flag, and sets it to the code reference in $code. Example:

$parser->add_flag("u" => sub { 0x10 });

This makes 'u' a valid flag for your regex, and creates the method FLAG_u. This doesn't mean you can use them on qr//, but rather that you can write (?u:...) or (?u). The values 0x01, 0x02, 0x04, and 0x08 are used for /m, /s, /i, and /x in Perl's regexes.

The flag handler gets the parser object and a boolean as arguments. The boolean is true if the flag is going to be turned on, and false if it's going to be turned off. For (?i-s), FLAG_i would be called with a true argument, and FLAG_s would be called with a false one.

If the flag handler returns 0, the flag is removed from the resulting object's visual flag set, so (?ig-o) becomes (?i).

There is a specific scheme to how you must name your handlers. If you want to install a handler for '&&', you must first install a handler for '&' that calls the handler for '&&' if it can consume an ampersand. Handle names that have no "predecessor" (that is, a '&&' without a '&') are pre-consumption: that is, they have not matched something yet. Handle names that do have a "predecessor" (that is, a '&&' with a '&') are post-consumption: they have already matched what they are named.

The handle 'atom' is pre-consumptive (because there is no 'ato' handle, basically). In order for the 'atom' handle to be executed, you must explicitly add it to the queue ($parser->{next}).

The handle '|' is post-consumptive. It happens to be executed when 'atom' matches a '|'. This means the handler for '|' does not need to match it; it has already been consumed.

If you created a handle for '&&' without a predecessor, you would have to add it explicity to the queue for it to ever be executed. As such, it would be pre-consumptive.

There is an interesting case of the right parenthesis ')'. There cannot be one without a matching left parenthesis '('; if there is an extra ')' a fatal error is thrown. However, the nature of 'atom' is to match a character, see if there's a handler installed, and call it if there is. I don't want atom to handle ')', so the handler is:

The name is 'c)' which has no predecessor 'c', so that means it is pre-consumptive, which is why it must match the right parenthesis itself. The handler throws an error if it can't match the ')', because if the 'c)' handler gets called, it's expected to match! It pops the flag stack, and returns an object.

Finally, if you want to add a new POSIX character class, its handler must start with "POSIX_".

For those of you that don't know, (?p{ ... }) is a synonym for the more common (??{ ... }). Using the 'p' form is deprecated, but is still allowed, so I delete its handler too. You can use this class to ensure that there is are no code-execution statements in a regex:

use Regexp::NoCode;
my $p = Regexp::NoCode->new;
# if it failed, reject it how you choose
if (! $p->regex($regex)) {
reject_regex(...);
}

Any regex containing those assertions will fail to compile and throw an error (specifically, RPe_NOTREC, "Sequence (?xx not recognized"). If you want to throw your own error, see "ERROR HANDLING".

That means that when an 'open' node is walked into, after it has been walk()ed, it will insert the matching 'close' node into the walking stack.

The purpose of adding an ending node to the walking stack is that ending nodes are all omitted from the tree because of the stacked nature of the tree. However, having them returned while walking the tree is helpful.

The walk() method is used to modify the walking stack before the node is returned. Here is the walk() method for all the quantifier and 'minmod' nodes:

The two additional arguments sent are the walking stack and the current depth in the walking stack. Elements are taken from the front of the walking stack, so we add them in the order they are to be encountered with unshift(). The two code references are used to go deeper and shallower in scope; sub { -1 } is used to go down into a deeper scope, and sub{ +1 } is used to come up out of it. In between these is $self->{data}, which is the node's child.

Ok, back to our Regexp::AndBranch example. Let me explain what the '&' metacharacter will mean. If you've used vim, you might know about its '\&' regex assertion. It's an "AND", much like '|' is an "OR". The vim regex /x\&y/ means "match y if x can be matched at the same location". Therefore it would be represented in Perl with a look-ahead around the left-hand branch: /(?=x)y/. We can expand this to any number of branches: /a\&b\&c\&d/ in vim would be /(?=a)(?=b)(?=c)d/ in Perl. We will support this with the '&' metacharacter.

We have added a handler for the '&' metacharacter, but now we need to write the supporting class for the Regexp::AndBranch::and object it creates!

A method call for a Regexp::MyRx::THING object will look in its own package first, then in Regexp::MyRx::__object__ (if it exists), then in Regexp::Parser::THING (if it exists), and finally in Regexp::Parser::__object__.

Here, @kids is an array that holds array references; each of those array references is the body of one and-branch. We will take the last one off and keep it normal, but the others we will make to be look-aheads. To make an object, we need to access $self->{rx}.

The 'ifmatch' object is a positive looking assertion, and the argument of 1 means it's a look-ahead. We send the unrolled contents of the array reference as the contents of the look-ahead, and we're done. Now we just need to return the regex representation of our children:

Character classes are not returned all at once, but piece by piece. Because range checking ([a-z]) requires knowledge of the characters on the lower and upper side of the range, objects must be created during the first pass. To accomplish this, use force_object(), which creates an object regardless of what pass it's on.

Also note the RPe_BADESC warning takes two arguments: the character that was unexpectedly escaped, and a string. If the warning is called from a character class, pass " in character class"; otherwise, pass an empty string.