NAME

Marpa::R2::Progress - Progress reports on your parse

About this document

This document describes the progress reports for Marpa's SLIF interface. These allow an application to know exactly where it is in the parse at any point. For parse locations of the user's choosing, progress reports list all the rules in play, and indicate the location at which the rule started, and how far into the rule parsing has progressed.

Progress reports are extremely useful in debugging grammars and the detailed example in this document is a debugging situation. Readers specifically interested in debugging a grammar should read the document on tracing problems before reading this document.

Introduction to Earley items

To read the show_progress output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section.

Dotted rules

Marpa is based on Jay Earley's algorithm for parsing. The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.

Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's perly.y

termbinop -> term ASSIGNOP term

ASSIGNOP is perly.y's internal name for the assignment operator. In plain Perl terms, this is the "=" character.

In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols.

Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule.

Here's our rule with a dot location indicated:

termbinop -> · term ASSIGNOP term

The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a prediction or a predicted rule.

Here's another dotted rule:

termbinop -> term · ASSIGNOP term

In this dotted rule, we are saying we have seen a term, but have not yet recognized an ASSIGNOP.

There's another special kind of dotted rule, a completion. A completion (also called a completed rule) is a dotted rule with the dot after all of the symbols. Here is the completion for the rule that we have been using as an example:

termbinop -> term ASSIGNOP term ·

A completion indicates that a rule has been fully recognized.

Earley items

The dotted rules contain all but one piece of the information that Marpa needs to track. The missing piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Marpa uses what are now called Earley items.

A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location (also called the dot location) is the location in the input stream which corresponds to the dot position.

In Marpa terms, G1 location is location in terms of the G1 subgrammar's Earley sets. When the term "location" is used in this document, it means G1 location unless otherwise indicated.

A user often finds it much more convenient to think in terms of line and column position in the input stream, instead of G1 location. Every G1 location corresponds to a range of positions in the input stream. When the term "position" is used in this document, it means input stream position, unless otherwise indicated.

Two noteworthy consequences follow from the way in which origin and current G1 location are defined. First, if a dotted rule is a prediction, then origin and current location will always be the same. Second, the input stream location where a rule ends is not tracked unless the dotted rule is a completion. In other cases, an Earley item does not tell us if a rule will ever be completed, much less at which location.

The problem

For this example of debugging, I have taken a very simple prototype of a string expression calculator and deliberately introduced a problem. I've commented out one of the correct rules:

# <numeric assignment> ::= variable '=' <numeric expression>

and replaced it with a altered one:

<numeric assignment> ::= variable '=' expression

For those readers who like to look ahead (and I encourage you to be one of those readers) all of the code and outputs for this example are collected in the "Appendix".

This altered rule contains an mistake of the kind that is easy to make in actual practice. (In this case, a unlucky choice of naming conventions may have contributed.) The altered version will cause problems. In what follows, we'll pretend we don't already know where the problem is, and that in desk-checking the grammar our eye does not spot the mistake, so that we need to use the Marpa diagnostics and tracing facilities to "discover" it.

The example

The example we will use is a prototype string calculator. It's extremely simple, to make the example easy to follow. But it can be seen as a realistic example, if it is thought of as a very early stage in the incremental development of something useful.

At this stage of developing our string calculator, we have assignment, variables, constants, concatenation and conversion of numerics. For numerics, we have assignment, variables, constants, multiplication and addition.

We decide that, since string expressions and variables are the "default", that in the grammar we'll make the symbol names for numeric assignment and expressions explicit: <numeric expression> and <numeric assignment>. But since strings are the default, we decide to call our string expressions simply <expression>, and to call our string assignments simply <assignment>. This seems like a good idea, but it is also likely to cause confusion. For the sake of our example we will pretend that it did.

The value of the parse

In debugging this issue, we'll look at the value of the parse first. The parse value differs from the other debugging aids we'll discuss. Every other debugging tool we will describe is always available, no matter how badly the parse failed. But if you have a problem parsing, you often won't get a parse value.

Our luck holds. Here's a dump of the parse value at the point of failure. It's a nice to way to see what Marpa thinks the parse was so far.

If we were perceptive, we might spot the error here. Our parse is not quite right, and that shows up in the outer My_Nodes::expression -- it should be My_Nodes::numeric_expression. We'll assume that we don't notice this.

In fact, in the following, we'll pretend we haven't seen the dump of the parse value. We can't always get a parse value, so we don't want to rely on it.

Output from trace_terminals()

You can rely on getting the output from trace_terminals, and it is a good next place to check. Typically, you will be interested in the last tokens to be accepted. Sometimes that information alone is enough to make it clear where the problem is.

The full trace_terminals output for this example is in the Appendix. We see that the recognizer accepts the input as far as the multiplication sign ("*"), which it rejects. In Marpa, a lexeme is "acceptable" if it fits the grammar and the input so far. A lexeme is rejected if it is not acceptable.

A note in passing: Marpa shows the input string position of the tokens it accepts, discard and rejects. <whitespace> is supposed to be discarded and that was what happened at line 1, column 17. But the '*' that was next in the input was rejected, and that was not supposed to happen.

Output from show_progress()

Marpa's most powerful tool for debugging grammars is its progress report, which shows the Earley items being worked on. In the Appendix, progress reports for the entire parse are shown. Our example in this document is a very small one, so that producing progress reports for the entire parse is a reasonable thing to do in this case. If a parse is at all large, you will usually need to be selective.

The progress report that is usually of most interest is the one for the Earley set that you were working on when the error occurred. This is called the current location. In our example the current location is G1 location 5. By default, show_progress prints out only the progress reports for the current location.

Here are the progress reports for the current location, location 5, from our example.

Progress report lines

F19 @0-5 L1c1-16 :start -> statements .

The last field of each progress report line shows, in fully expanded form, the dotted rule we were working on. Prefixed to the dotted rule are three fields. In the example just above they are "F0 @0-5 L1c1-16". The "F0" says that this is a completed or final rule, and that it is rule number 0. The rule number is a convenient way to refer to a rule and is used when displaying the whole rule would take too much space.

The "@0-5" describes the G1 locations of the dotted rule in the parse. In its simplest form, the location field is two G1 location numbers, separated by a hyphen. The first G1 location number is the origin, the place where Marpa first started recognizing the rule. The last G1 location number is the dot location, the G1 location of the dot in a dotted rule. "@0-3" says that this rule began at G1 location 0, and that the dot is at G1 location 3.

Following the G1 location is the range of positions in the input string: "L1c1-16". This indicates that the origin of dotted rule is at line 1, column 1, and that its dot position is after line 1, column 16.

The current location is also just after line 1, column 16, and at G1 location 5, and this is no coincidence. Whenever we are displaying the progress report for a G1 location, all the progress report lines will have their dot location at that G1 location.

As an aside, notice that the left hand side symbol is :start. That is the start pseudo-symbol. The presence of a completed start rule in our progress report indicates that if our input had ended at location 5, it would be a valid sentence in the language of our grammar. (And it is because the input at G1 location 5 was a valid sentence of the grammar, that we were able to look at the value of the parse at location 5 for debugging purposes.)

Let's look at another progress report line:

R11:2 @2-4 L1c3-13 expression -> expression '+' . expression

Here the "R12:2" indicates that this is rule number 12 (the "R" stands for rule number) and that its dot position is after the second symbol on the right hand side. Symbol positions are numbered using the ordinal of the symbol just before the position. Symbols are numbered starting with 1, and symbol position 2 is the position immediately after symbol 2.

Predicted rules also appear in progress reports:

P2 @3-3 L1c5-11 statement -> . <numeric assignment>

Here the "P" in the summary field means "predicted". Notice that in the predicted rule, the origin is the same as the dot location. This will always be the case with predicted rules.

OK! Now to find the bug

If we look again are progress reports at the location 5, the location where things went wrong: We see that we have completed rules for <expression>, <numeric assignment>, <statement>, <statements>, as expected. We also see two Earley items that show that we are in the process of building another <expression>, and that it is expecting a '+' symbol.

What we want to know is, why is the recognizer not expecting an '*' symbol? Looking back at the grammar, we see that only one rule uses the '*' symbol. Here it is as part of a prioritized rule in the DSL:

No R19 predicted at G1 location 0. Next we look through the the entire progress report, at all G1 locations, to see if R19 is predicted anywhere. No R19. Not anywhere.

The LHS of R19 is <numeric expression>. We look in the progress report for dotted rules where <numeric expression> is expected -- that is, dotted rules where <numeric expression> is the post-dot symbol. There are none.

Next we look for places in the progress reports where <numeric expression> occurs at all, whether post-dot or not. In the progress reports, <numeric expression> occurs in only two dotted rule instances. Here they are:

P10 @2-2 L1c3 expression -> . 'string' '(' <numeric expression> ')'

P10 @4-4 L1c13 expression -> . 'string' '(' <numeric expression> ')'

In both cases these are predictions of a string operator, the operator we plan to use for converting numerics to strings. They are just predictions, predictions which go no further because there is no 'string' operator in our input. That's fine, but why no other, more relevant, occurrences of <numeric expression>?

We look back at the grammar. Aside for the rule for the 'string' operator, <numeric expression> occurs on a RHS in two places. One is in the prioritized rule which defines <numeric expression>.

This rule will never put <numeric expression> into the Earley items unless there is a <numeric expression> already there. But that is not its job. This rule is just fine and does not need fixing.

That leaves one rule to look at.

<numeric assignment> ::= variable '=' expression

This rule is one that should lead to the prediction of a new <numeric expression> in our example. And now we see our problem. This rule is never leading to the prediction of a new <numeric expression>, because there is no <numeric expression> on its RHS, or for that matter anywhere else in it. On the RHS, where we wrote <expression>, we should have written <numeric expression>. Change that and the problem is fixed.

Complications

We have finished our main example. This section discusses some aspects of debugging which did not arise in the example, and which might be unexpected.

Empty rules

When a symbol is nulled in your parse, show_progress show only the nulled symbol. It does not show the symbols expansion into rules, or any of its nulled child symbols. This reduces clutter, and usually one does not notice the missing nulled rules and symbols. Not showing these seems to be the intuitive way to treat them.

Input string ranges

G1 locations run in a monotonic sequence, starting with 0. G1 locations never run backwards, they are never visited twice, and they leave no gaps.

Input string positions, on the other hand, can do all of these things. An application is allowed to jump around in the input. An input string position may be encountered more than once. It is quite possible to write your application so that it encounters, for example, line 42 before line 7. And your application does not have to visit line 42 on its way from line 41 to line 43. For that matter, an application does not ever have to visit any position in its input.

How does Marpa deal with this when reporting input string ranges? Marpa always reports the minimum range that includes all the input string positions visited in the dotted rule. The range is always reported in increasing numeric order, even when the position at the end of the range was visited before the input string position at the beginning of the range. And, if necessary to include all visited input string positions, the range may include input string positions which were not visited.

Most applications move forward continuously in the input string, and if yours is one of them, you don't have to worry about these issues. But if you do unusual things when reading the input, it helps to be aware of how input string ranges are reported by Marpa when tracing and debugging.

Multiple instances of dotted rules

It does not happen in our main example for this document, but a dotted rule can appear in the same Earley set more than once. In fact, this happens frequently. When it does happen, the lines in the progress report will look like these

These are some of the progress report lines for an indirect right recursion, one that recurses from a <plain assignment> symbol to an <expression> symbol, and then to an <assignment> symbol, before completing the recursion by returning to a <plain assignment>.

In each of the three lines, notice that a new field appears second. This second field is variously "x13", "x20" or "x12". These are counts, indicating the number of instances of that dotted rule at the dotted rule's G1 dot location. Every dotted rule instance will have the same G1 location, but the instances may have many different origins -- hundreds or even more. In each of the three report lines above, the G1 dot location is 41.

Note that when parsing, Marpa handles the long series of Earley items generated by right recursions very efficiently. It uses a technique invented by Joop Leo to memoize and eliminate them. When a progress report is requested at a G1 location, the Leo-memoization is unfolded, and the full list of Earley items is reported.

Each instance may have its own span in the input string, and the input string range will include them all. When there are many instances of a dotted rule at a single location, the origins in the location field are shown as a range, with the earliest separated from the most recent by a "...". For example, above, where the first four fields were "F7 x12 @0...38-41 L1c1-L2c40", that tells us that the dotted rule is rule 7, which has 12 instances. All 12 instances have their dot location at G1 location 41, but their origins are in the range from G1 location 0 to G1 location 38.

The last field in "F7 x12 @0...38-41 L1c1-L2c40" is an input string range. "L1c1-L2c40" says that input string positions visited by the the 12 instances start at line 1, column 1, and end at line 2, column 40. The reported input string range will be the shortest range that includes all of the input string positions visited by any of the dotted rule instances.

If there are only a few origins, Marpa may explicitly list them all. In the follow example, there are only 2 instances of this rule, both with a dot location of 41. Their origins are at G1 locations 8 and 18. The range of input string positions is from line 1, column 17 to line 2, column 40.

F2 x2 @8,18-41 L1c17-L2c40 assignment -> <divide assignment> .

Access to the "raw" progress report information

This section deals with the progress() recognizer method, which allows access to the raw progress report information. This method is not needed for typical debugging and tracing situations. It is intended for applications which want to leverage Marpa's "situational awareness" in innovative ways.

progress()

my $report0 = $slr->progress(0);

my $latest_report = $slr->progress();

Given the G1 location (Earley set ID) as its argument, the progress() recognizer method returns a reference to an array of "report items". The G1 location may be given as a negative number. An argument of -X will be interpreted as G1 location N+X+1, where N is the latest Earley set. This means that an argument of -1 indicates the latest Earley set, an argument of -2 indicates the Earley set just before the latest one, etc.

Each report item is a triple: an array of three elements. The three elements are, in order, rule ID, dot position, and origin. The data returned by the two displays above, as well as the data for the other G1 locations in our example, are shown below.

Dot position is -1 for completions, and 0 for predictions. Where the report item is not for a completion or a prediction, dot position is N, where N is the number of RHS symbols successfully recognized at the G1 location of the progress report.

Origin is the G1 location (Earley set ID) at which the rule application reported by the report item began. For a prediction, origin will always be the same as the G1 location of the parse report.

Progress reports and efficiency

When progress reports are used for production parsing, instead of just for debugging and tracing, efficiency considerations become significant. Progress reports themselves are implemented in optimized C, and that logic is very fast. However, the use of progress reports usually implies considerable post-processing in Perl. It is almost always possible to use Marpa's named events instead of progress reports, and solutions using named events are usually better targeted, simpler and faster.

If you do decide to use progress reports in an application, you should be aware of the efficiency considerations when there are right recursions in the grammar. For most purposes, Marpa optimizes right recursions, so that they run in linear time. However, to create a progress report every potential right recursion must be fully unfolded, and at each G1 location the number of these grows linearly with the length of the recursion. If you are creating progress reports for more than a limited number of G1 locations, this means processing that can be quadratic in the length of the recursion. When a right recursion is lengthy, the impact on speed can be be very serious.

If lengthy right recursions are being expanded, this will be evident from the parse report itself, which will contain one report item for every completion in the right-recursive chain of completions. Note that the efficiency consideration just mentioned for following right recursions is never an issue for left recursions. Left recursions only produce at most two report items per G1 location and are extremely fast to process. It is also not an issue for Marpa's sequence rules, because sequence rules are implemented internally as left recursions.

Appendix

Below are the code, the trace outputs and the progress report for the example used in this document.

Parse value at error location

Note that when there is a parse error, there will not always be a parse value. But sometimes the parse is "successful" enough, in a technical sense, to produce a value, and in those cases examining the value can be helpful in determining what the parser thinks it has seen so far.

show_rules() output

This is the G1 portion of the show_rules() output at verbosity level 3. In ordinary work, you'd use verbosity level 1 (the default), but the more verbose output is included here to illustrate the example.

progress() outputs

These section contains samples of the output of the progress() method -- the progress reports in their "raw" format. The output is shown in Data::Dumper format, with Data::Dumper::Indent set to 0 and Data::Dumper::Terse set to 1.

Copyright and License

Copyright 2014 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.