PAST Pattern Matching

Submitted by tcurtis on Sun, 05/16/2010 - 06:31

Last Wednesday, I discussed a little bit of the rationale behind my GSoC project and summarized the most low-level portion of my project: PAST::Walker. Today, I want to describe another portion of my project: PAST::Pattern. PAST::Walker provides a very powerful and complete interface. Any possible transformation or other traversal of a PAST should be implementable using it. However, it will not be very convenient if you only want to turn return nodes containing only a call node into tail-call nodes. PAST::Pattern is meant to be a higher-level approach to the analysis, traversal, and transformation of PASTs. Note that every aspect of the design of this or any other portion of my GSoC project is still tentative, and I welcome any feedback or suggestions concerning it. As I said last time, I can be reached on #parrot as tcurtis or by email at tyler.l.curtis@gmail.com or on parrot-dev.

A PAST::Pattern is to PASTs as a regex is to strings. Just as you could hand-write your textual pattern matching and parsing code in PIR instead of using NQP-rx and its convenient regexes, you can use the PAST::Walker API I described last Wednesday to do your PAST analysis and transformation work; however, you can also use a PAST::Pattern and let it handle the tedious work of walking a PAST, identifying the parts that you're interested in, and doing what you want with it.

To describe my (again, tentative) design for PAST::Patterns, I'll start by assuming that you have a PAST::Pattern lying around, and you want to do things with it. Your patterns will respond to two main methods: 'match' and 'subst', named after the analogous Perl 6 Regex methods. The 'match' method will take a PAST::Node argument and will return either a PAST::Pattern::Match result object or an array of PAST::Pattern::Match objects describing the result of attempting the match the node, depending on whether the pattern matched only once or more than once. The 'subst' method will take two arguments: a node, and an argument specifying what to replace the node with. If this argument is a PAST::Node, then any parts of the first node that match the pattern will be replaced with a copy of the replacement node. If the replacement is not a PAST::Node, then it is assumed to be a Sub or something else with an invoke vtable, and it is invoked on each match result. The result of this call should be a PAST because it will be used to replace the original matching PAST. If there is some way of detecting if a PMC's invoke vtable has been changed from the default exists, it could be used to throw a more informative exception if the supplied argument is neither PAST nor invokable.

When you call the 'match' method of your pattern on some PAST, you'll get a PAST::Match object. Like Perl 6's regex Match result objects, this PAST::Match describes the result of the attempted match and includes a boolean success value, array indexing for numbered submatch captures(the equivalent of "(foo)" in a regex), and hash indexing for named submatch captures("$ = bar"). When converted to boolean, the Match object produces true if the match was successful and false otherwise.

Patterns can also have child patterns, just as PAST::Nodes can have child nodes. These can be accessed using array interfaces(indexing, pushing/popping, shifting/unshifting, etc.). They can also typically be provided in a slurpy argument to a Pattern constructor.

Now that I've described the general interface that all PAST::Pattern objects share, I'll describe the specific PAST::Pattern subclasses so that you can know how you'll be able to actually create patterns to match and substitute. Each of these, and sometimes even particular attribute values for them, will have convenience functions in the PAST::Pattern namespace. For example, instead of PAST::Pattern::Block::new(), one can use PAST::Pattern::block().

The pattern classes can be separated into two categories: patterns that match specific node types and abstract patterns that don't care specifically about what type of data they're matching on and are as applicable to regexes as they are to PAST::Patterns. I'll describe the latter first. The regex counterparts of some of these include concatenation("foobar"), alternation("foo|bar"), the dot which matches anything("."), and numbered ("(foo)") and named captures("$ = bar"). There are also a few that, as far as I know, don't have regex counterparts. For example, I don't know of a way to, in a regex, specify that a string must match both of two patterns. One can use PAST::Pattern::seq to create a PAST::Pattern::Sequence object which will match PAST::Stmts or PAST::Block nodes which has children who match each of the pattern's children in order. PAST::Pattern::Some patterns, created with PAST::Pattern::some, will match any pattern that matches any of their child patterns. PAST::Pattern::All patterns, created by PAST::Pattern::all, match only those patterns that match all of their child patterns. PAST::Pattern::Complement patterns, created by PAST::Pattern::complement, match only those PASTs that do not match their child pattern. PAST::Pattern::Yes(or PAST::Pattern::yes) is equivalent to the "." in regexes: it matches anything. PAST::Pattern::No(or PAST::Pattern::no) is equivalent to PAST::Pattern::complement(PAST::Pattern::yes()): it never matches. For numbered captures, you can create a PAST::Pattern::Capture with PAST::Pattern::capture, and you can use PAST::Pattern::named to create a PAST::Pattern::Named pattern for named submatch capturing. You can create a PAST::Pattern::Backref to match the value that a numbered or named match variable is bound to with PAST::Pattern::backref(aNumber) or PAST::Pattern::backref("aName").

Obviously, unless you want to match everything or nothing, to actually do anything useful with patterns, you have to have concrete patterns to match actual PAST nodes. These are straightforward, for the most part. There is one PAST::Pattern subclass for each PAST::Node subclass. For each attribute of each node class, there is an attribute on the corresponding pattern class. The pattern attributes can hold either literal values for the attributes(typically, PAST::Nodes, strings, or numbers), regexes or similar(essentially, anything with a 'match' method), or Subs. If none of these attributes are given a value, the pattern will match any node of the appropriate class. If the attribute is not given a value, it doesn't affect whether the object matches at all. If the attribute has a value that can be invoked, it is invoked on the value of the corresponding attribute of the node to matched. If the invocation returns false, then the match fails. If the attribute has a value that has a 'match' method, the value of the appropriate attribute of the node to be matched must match the attribute value of the pattern. Otherwise, the pattern attribute value must be equal to the node attribute value.

The node pattern classes will be PAST::Pattern::Block, PAST::Pattern::Stmts, PAST::Pattern::Var, PAST::Pattern::Val, PAST::Pattern::VarList, and PAST::Pattern::Op. Each of these has a convenience function with a similar name: PAST::Pattern::block, PAST::Pattern::stmts, etc. In general, each convenience function, as well as the actual constructors, will have a slurpy argument for child patterns. Some of the node pattern classes, PAST::Pattern::Block, PAST::Pattern::Op, and PAST::Pattern::Var, have a few specialized convenience functions. PAST::Pattern::Block's special convenience functions correspond to the valid values for the :blocktype attribute. Two of the special convenience functions for PAST::Pattern::Op focus on the :inline and :pirop attributes respectively, accepting as their first argument the value for the appropriate attribute, with the remainder of the arguments being treated as children. The others are specialized based on :pasttype. The PAST::Pattern::callOp and analogous functions(callmethodOp, tailcallOp) also accept :sub and :args keyword arguments for matching on the sub and the argument array of the call regardless of where the sub is place. PAST::Pattern::Var's special convenience functions correspond to the various values of the :scope attribute.

As always, I would very much appreciate any feedback. One question I'd like to have an answer to is whether anyone would have any use for non-global matching of PAST::Patterns. I plan to post again sometime next week with some example code for both PAST::Walker and PAST::Pattern.

Perl 6 regexes (which nqp-rx approximates) do in fact include a way to specify that multiple patterns must all match. Aside from the trivial option of using lookahead, e.g.
/ <?before ..o> fo. / # matches 'foo'
there is also the 'conjunction' operator, &, as opposed to the disjunction operator, |. It has the benefit of being declarative rather than procedural (though it has a procedural analogue in &&), with the tradeoff that it requires all patterns to be the same length (and of course have the same start point). The above pattern would therefore be expressed as:
/ [ ..o & fo. ] / # still matches 'foo', but is very likely much faster than lookahead