New directions for Piggy patterns

Up to now, I have been making an assumption for the Piggy transformation system that regular expressions and NFA recognizers can be used to match arbitrary ASTs. After all, a lot of research has been published for tree recognizers, including XPath, TreeRegex, and Tregex, which are NFA based. With all this, I lost sight of the problem of trying to match subtrees within an AST. Unfortunately, pattern matching of ASTs is not regular, and so cannot be recognized by an NFA. This blog entry goes over why NFAs cannot be used, and how Piggy patterns should work.

XPath is regular; AST grammars are not regular

At issue is the difference between a grammar that recognizes a path through a tree vs. a grammar that recognizes an entire tree.

XPath is a language that matches a particular path of a node through the tree from the root. A grammar that models the abbreviated form of an XPath expression is:

ASTs, as output from a Piggy serializer, is represented as a parenthesized expression. A grammar that recognizes an serialized AST is:

s -> ( ID s* ) ;

Ignoring the fact that this grammar uses the Kleene closure on a non-terminal, this grammar is not regular because the production contains a non-terminal in the middle of the right-hand side.

The conclusion is that Piggy expressions cannot be regular expressions, and they cannot be recognized by an NFA.

The “…” operator problem

In addition to the above problem, Piggy implicitly adds an operator to accept any tree nodes that are “irrelevant” to a matching pattern. This operator is used in Coccinelle, and it used to match arbitrary code [Padioleau 2006, 2007] when the pattern needs to refer to two or more non-adjacent code blocks separated by code that is “irrelevant“. Piggy offers a similar operator but differs in a number of ways.

In Piggy, the “…” operator is equivalent to “.*”, where “.” refers to an AstParserParser.NodeContext or AstParserParser.AttrContext, and “*” is the Kleene operator (see grammar for node and attr ASTs in Piggy). This contrasts with Coccinelle, where “…” can refer to any code path, including ones that are nested within another block. In Piggy, the irrelevant code operator is inserted automatically in the user pattern, before every AstParserParser.NodeContext or AstParserParser.AttrContext. The reason for this is so the user doesn’t have to explicitly add it, which helps the readability of a pattern. In Coccinelle, the operator is explicit.

Unfortunately, adding the irrelevant code operator “…” to a pattern results in a pattern that is ambiguous, meaning there are multiple parse trees for a given input. How does the recognizer work and resolve the ambiguity in patterns? For an NFA, I ordered the edges of the NFA and performed a greedy pattern match. Unfortunately, this isn’t going to work for a CFG recognizer.

In any case, the irrelevant operator is essential for Piggy. But, adding it makes the resulting pattern highly ambiguous.

What should patterns in Piggy look like?

At this point, I need to re-evaluate what a Piggy pattern looks like. There are really three different possibilities for a pattern:

a collection of visitor or listener functions;

a collection of grammar rules with integrated text and code blocks;

a parse string with integrated text and code blocks.

Let’s examine each of these alternatives.

Piggy patterns as AST visitors and listeners

With this method, Piggy patterns are simply text and code blocks that are executed via an AST traversal. The pattern matcher traverses the tree using a pre- or post-order traversal, which can be specified for the pattern.

The main issue here is that the pattern is now specified as a code predicate that indicates a match. The LHS symbol of the pattern, “ast”, means that the code is executed when an “ast” node type is traversed.

There are several other issues specifying patterns in this manner. One issue is that in order to note that “return true;” is used to note to the engine that the predicate matches. Another issue is that the runtime must provide appropriate routines to skip nodes for the irrelevant operator. In the above example, Child(0) means to match exactly the first child of a node. In all likelihood, the pattern will require for-loops, or Linq expressions, to scan the children. Lastly, it is difficult to visualize the subtree structure based on C# code. That was always the reason for a parenthesized expression in Piggy for patterns.

The advantage of this method is that it works in much the same was as YACC or Antlr actions. The main disadvantage is that the rules much match exactly the syntax within the original grammar, so must be carefully duplicated from the grammar.

Antlr Tree Patterns

Antlr version 4 has a pattern matching mechanism to parse input according to a grammar and fill in placeholders for nodes of the parse tree. This feature appears useful for matching Piggy patterns, but it seems to have several shortcomings.

For example, consider the following code that takes a pattern and a Piggy AST, then tries to match the pattern against the tree.

However, Antlr tree patterns do not work for patterns that contain an irrelevant operator, since there is no way to specify EBNF within the pattern.

Fun("( <ID> <attr> )", @"( u a=""1"" ( v ) )"); => false

Tagged nodes must be a left-hand symbol of a rule. Again, there is no way to use EBNF in a parse string pattern.

Fun("( <ID> <node | empty> )", @"( u ( v ) )"); => throws exception.

Parse tree patterns might be more useful if tagged items could contain the irrelevant operator and BNF syntax.

Conclusions

Some lessons are hard to learn. Piggy patterns cannot be expressed in XPath-like syntax because they need to express a whole subtree. We have looked at a number of ways to express patterns. The most promising seems to be something like Antlr’s parse tree patterns, perhaps even in code pre-serialized. But, it’s unclear how these would look in Piggy.

1/2: Adding to #Antlrvsix an analysis tool of #Antlr grammars. This is how it works with cycle detection and useless lexer rules. There are some issues in the responsiveness of the MS LSP client for VS2019.

Adding to #Antlrvsix the refactoring to remove useless parentheses in an #Antlr grammar. This is how it works with the extra parentheses in the arrayAccess_lf_primary rule in Java9.g4 that nobody knew were there. Only yet starting to scratch the surface of grammar optimizations.

Implementing #Antlr grammar fold refactorings in #Antlrvsix. Two types: extract a selected sequence of symbols and make a rule (shown first); replace all occurrences in the grammar with a folded rule (shown second). Spacing and comments do not matter.