Note: If you like this tool please remember to vote, as this will get other people interested and is more likely to help improve the final product-- ToDoList organizer, Dan.G.

Introduction

The Spart library an object oriented recursive descent parser generator framework implemented in C#. In fact, it is a partial port of the excellent Spirit library, which is written in C++ and uses meta-programming.

The Spart framework enables a target grammar to be written exclusively in C#. An EBNF grammar can be closely match using C# code. In retrospect, conventional compiler-compilers or parser-generators have to perform an additional translation step from the source EBNF code to C or C++ code.

I have takened the liberty to use the structure (and some sentence) of the Spirit documentation. Along the article, some notes are added regarding some issues about the port to C#: Spirit-2-Spart Notes (SSN).

As always, this article presents an overview of the library. For deeper details, please refer to the NDoc documentation. The library also comes with a battery of NUnit tests.

Quick Start

Spart is designed to bring you parser capabilites quickly directly into your code. While it is not suited for creating parsers for entire language like C,C++, it is very effective for building micro-grammars in your code.

When you need to build a new parser, there are existing solution: a combination of ....Parse (like int.Parse) calls, or using regular expression (Regexp class) or a combination of both. However, these tools do not scale well when attempting to write more complex parsers: maintenance and readability become ackward.

So, as for Spirit, one of the main objective of Spart is to let you build easily grammar in C#. To fix this ideas, a few simple grammars illustrate Spart usage:

Trivial example #1:

Create a parser that will parse a digit:

Prims.Digit

(This is a trivial case, Char.IsDigit already does that). Prims is a static class, Digit is a property that create a new parser for digits. In fact, Prims is a helper class that creates primitive parsers for you and hides implementation details.

SSN: This parser is equivalent to num_p.

Trivial example #2:

Create a parser that will parse a sequence of two digits

Ops.Seq( Prims.Digit, Prims.Digit )

Here you see the familiar Prims.Digit parser enclosed in a Ops.Seq call. Like Prims, Ops is a static helper class that creates combine parsers for you. The Seq method creates a parser that is a sequence of two parsers (>> in Spirit):

Ops.Seq(a,b) <=> match a and b in sequence

Note: when we combine parsers, we end up with a "bigger" parser, But it's still a parser. Parsers can get bigger and bigger, nesting more and more, but whenever you glue two parsers together, you end up with one bigger parser. This is an important concept.

SSN: The operator >> does not accept arbitrary operands, they must be of integral type which restrict it's use.

Trivial Example #3

Create a parser that will accept an arbitrary number of digit. (Arbitrary means anything from zero to infinity).

Ops.Star(Prims.Digit)

This is like the regular expression Kleene Star.

SSN: * cannot be an unary operator in C#.

Less trivial example #4

Create a parser that parse a sequence of comma separated digits and record them in a collection (note this can easily be done using String.Split).

Parser is an abstract base class for all parsers. It contains the Parse method that can be used to parse some input.

The parser does not work directly on string but rather on some modified stream (scanners). Therefore, it is possible to parse directly from files or stream. StringScanner implements the scanner interface and wraps the string s.

ParserMatch is the parser result (see below)

Now that we have parsed the text, the ParserMatch object can help answer questions like: was the match successful, what was the match value, etc... :

if (m.Success)
Console.Write("successfull match!");

Semantic Actions

Our parser above is nothing but a recognizer, it does no take any actions. It answers "did our data match the grammar?" but it does not record anything. Remember that we wanted to record the digits into a collection. For example, whenever we parse a digit, we wish to store the parsed number after a successful match. We now wish to extract information from the parser. Semantic actions do this. Semantic actions may be attached to any point in the grammar specification. They through events and event handler.

The Parser class has an event, Act, that is called on a successful match. We need to add a event handler on the Prims.Digit parser that records the digit into a collection. First, we write an actor that will record the digits

This is the same parser as above but now, MyParser.RecordDigit is called on each successfull digit match and therefore, the collection is filled.

Basic concepts

Spart follows the concepts of Spirit. There are a few fundamental concepts that need to be understood well: 1) The Parser, 2) the Match, 3) The Scanner, and 4) Semantic Actions. These basic concepts interact with each other, and the functionalities of each interweave throughout the entire framework to make it one coherent whole.

I will go quickly over those concepts since they are very well explained in the Spirit documentation and I recommend you take a look there first.

The parser

Central to the framework is the parser. The parser does the actual work of recognizing a linear input stream of data read sequentially from start to end by the scanner. The parser attempts to match the input following a well-defined set of specifications in the form of grammar rules. The parser reports the success or failure to its client through a match object. When successful, the parser calls a client-supplied semantic action. Finally, the strategically-placed semantic action extracts structural information depending on the data passed to it by the parser and the heirarchical context of the parser it is attached to.

Parsers come in a lot of flavors and usually you don't need to write your own parser. Spart has a collection of built-in parsers that you can combine to create your grammars. The built-in parsers come in two (main) flavors:

Primitives

Primitive parsers can be used to match characters, string, lower case character, digits, etc... The Prims class can be used to create such parsers.

Combination

Combination parsers can be used to combine parsers, like sequence and star in the example. The Ops class can be used to create such parsers.

The match

The ParserMatch class describes the parser match.

The Scanner

Like the parser, the scanner is also an abstract concept, represented by the IScanner interface. The task of the scanner is to feed the sequential input data stream to the parser. The scanner is of an input source and a cursor. The cursor is moved along by the parsers. Parsers extract data from the scanner and position the iterator appropriately through its member functions.

Semantic actions

A composite parser forms a hierarchy. Parsing proceeds from the topmost parent parser which delegates and apportions the parsing task to its children recursively to its childeren's children and so on until a primitive is reached. By attaching semantic actions to various points in this hierachy, in effect we can transform the flat linear input stream into a structured representation. This is essentially what parsers do.

The Rule

The Rule class represents a non-terminal parser. Basically, it is a wrapper around another parser. This aspect will be illustrated in the example below.

The Classic Calculator Example

There is still a lot to say about Spirit and Spart but I will cut to a final example. A better documentation should appear in a near future as this library is totally new!

The favorite grammar example in the Spirit documentation is a calculator grammar:

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).

There is a feature that I could not find in Spart: abstemious (non greed) match. I have to write alternative match, and a default match for anything. But when I use Ops.Klenee(Prims.AnyChar), it is greed, so it matches until the end of text. Is there some way to make it abstemious? (like regex *?)

Spart, clearly needs more examples. I am trying to find a Parseable expression in a sequence of random symbols.I am not able to do it no matter how hard I try. This is my latest invention. I am using the version from Palaso svn:

A newer version has been developed at http://code.wesay.org/Palaso/trunk/Palaso/Palaso/. It includes some minor changes and updates to take advantage of C# 2 (See also http://www.wesay.org/blogs/palaso/2007/09/30/10/)

I have recently become interested in writing a script engine for .NET conforming to similar, but not exact Lua-like syntax.

In doing so, I have found a grammar for Lua, and have been trying to implement it in Spart. I have started by modifying the basic variable declaration of Lua from using "local" to using "var", otherwise the remaining grammar has remained the same for this statement. So far I have tried to attach actions to various parsers within the statement, and the only action I can get to trigger is the "var" string parser action.

The test data I am using is a simple: var x = 123

The only action called from anything in the grammar seems to be for "var" as the value, and nothing else. I cannot get my identifierlist, or identifier indepenantly to trigger an action when parsed. As I understand it, I should be able to attach an action to any parser and when that parser is matched it will execute the attached actions. However, if I create an action for Identifier's and parse the above code, my action is not called and thus "x" is not printed for me.

It could be that my grammar is flawed, but it seems correct to me. I have never used spirit or spart before, and have only recently taken an interest in lex, yacc, and the descendant products from their influence. I would really appreciate an understanding as to why some actions are called and not others. My goal is to provide a stack implementation and have the actions work against a stack to perform their duties. In this simple example, I should be able to act upon each token, pushing operations and variables onto the stack to be processed by the stack API. This example should allow me to push a declaration operator, followed by a list of identifiers, followed by an assignment operator, followed by a list of expressions which may be further evaluated, but in this case would only push a numeric value. The stack containing these operations would allow me to execute an order of operations such that the identifiers would be associated in a dictionary as keys, to the values assigned (or null by default if the optional assignment is lacking).

Sorry for getting deep into this, but Spart looks like a viable option for my goal and I'd like to get it working before I digress to trying a combination of managed and unmanaged code using Spirit in MC++.

Thanks for your time, and thanks for your effort in porting a useful tool to the .NET world. Sure would have been nice to be able to implement all the operator overloads rather than falling back on static methods

this looks extremely interesting to me, so i tried to build it but get (after converting from Visual Studio 1.1 to 2.0 i guess):
Error 1 Source file 'D:\3rd_party\Spart\Spart\Debug\Debugger.cs' could not be opened ('The system cannot find the file specified. ') Spart
Error 2 Source file 'D:\3rd_party\Spart\Spart\Debug\DebugContext.cs' could not be opened ('The system cannot find the file specified. ') Spart

are these intentionally missing? is there a convention i should know about?

But I think if you can add the functionality to specify the grammar in a saperate txt file like lex/yacc, it would be very useful. I am using your code but everytime there is a change in rule, I need to recompile. Is there any workaround.

This is obviously only a quick correction. Jonathan assumed the count property would return the length of the match, which breaks when you combine operators like Ops.Klenee with subsequent Ops.Seq for example.

How is it possible to filter whitespace from the input text ? This is absolutely necessary for any kind of scanning / parsing. As far as I have seen, in the IFilter interface there only exist a method for converting * to lower case. In comparison , with JFlex you have much more flexibility regarding scanning. Otherwise, with Spart, you have to include whitespace parsing in the grammar itself, and then not associate actions for it, if you want it to be ignored, right ? Which is quite cumbersome provided you supply a grammar for a language with lots of ops and possible constructs

Thank you for this work. It is very useful.
The calculator sample shows the parsing but does not the compute the expression.
I enhanced the class with very little code to do this ..
Look the code below ( i use vb.net). It uses a NET stack object to push the numbers. When an arithmetic operation is parsed, it triggers an event to perform the operation. Since parsing operations are executed with the correct order in the stack, it works correctly!!

I have never seen Spirit so I assume you have copied the design. I do think its interesting that there is an abstract syntax tree hierarchy for the grammar, however, when it comes to parsing the tokens nothing is really done with them when the parsing is successful. Hopefully I read the code correctly; two weeks off and monday morning its difficult sometimes.

I do like the oop style to parsing rather than the original way of one class. This always seems to me scalablity was a factor but the object approach of encapsulating a parse method away in its own class shows good design.

Top down compilers in my limited exposure to them generally build an internal representation of the parse tree, whereby pushing the concrete syntax up into the tree. The Abstract Syntax Tree then can be visited multiple times using something known as the visitor pattern. The visitor pattern structure provides hooks into the AST which can be used to traverse over the AST multiple times if requires. This allows implementation for semantic checks, optimisation and code generation.

This also allows for the calculation of static FOLLOW, FIRST and LOOKAHEAD sets for the Non terminals. Using this methods also means you can generate dynamic follow set allowing for follow set error recovery.

Ben Coding Monkey wrote:however, when it comes to parsing the tokens nothing is really done with them when the parsing is successful.

That's where the actions come into the game. When a token/parser has a succesfull match, it triggers it's action if any.

Ben Coding Monkey wrote:The Abstract Syntax Tree then can be visited multiple times using something known as the visitor pattern.

Yes, I will be adding AST generation to Spart (in a undefinite future). It already took me quite some work to get a graph library running in C# (see QuickGraph) but it will be a good tool for building ASTs.

In Spirit, you handle AST generation by specifying which rule should produce a new token, or start a new branch, etc...

Ben Coding Monkey wrote:This also allows for the calculation of static FOLLOW, FIRST and LOOKAHEAD sets for the Non terminals. Using this methods also means you can generate dynamic follow set allowing for follow set error recovery.

The Follow set is the sets of symbols that can follow it. So in the right hand side of the non terminals what can follow it. Leaving us with the sets, I have ignored the empty word or termination symbol for clarity.

E = [ ")" ]
T = [ "x" ]
F = [ "+" ]

The lookahead sets are constructed of the First Set of a nonterminal and the follow of a nonterminal just constructing what could be expected from this nonterminal; I think.

These sets are static though. We have derived them from grammar analysis by looking at the grammar and evaluating them from the whole grammar. However, what if the sets where calculated on the fly ie what is immediate expected the immediate symbol the come. Could this then be used to recover from errors?

The answer is yes but its only an approximation. Errors recovery can perform three operation deletion, insertation or modification. Insertation is the easiest to explain. For example we have the Pascal variable declaration line

var i : integer;

however our code reads

var i integer;

The parsers has moved into the variable declaration method for parsing and has parsed the identified 'i' and is expecting the concrete syntax of the colon ':'. Its not there lets terminate throw millions of exceptions and die, or we could just put one in and continue.

Modification can also be used, however, tricky because this is where we are assuming something that could cause mistakes later, but here is an example.

var i . integer;

There is a full spot '.' when we expected a colon ':' so lets change it and continue. I am guessing you can see the problem with this, or the potential.

Deletion is the last resort because why would we want to remove anything. Well infact this is why its tricky. Basically we have constructed a set, immediate set of what can follow and thus the algorithm is to delete each token until a token is reach which matches one of the tokens in this set. There is a possibility of removing all tokens, however, an errors has already been reached so its wrong anyway. This is why sometimes you get shed loads of errors when you have made 1 or 2 mistakes because of the error recovery method, however, commercial compilers are generally bottom up.

So for our example grammar we calculate the immediate follow sets but this is for the right hand side symbols so even the terminals have follow sets:

E:
T = [ "x" ]
"x" = FirstSet (E)
T
F = [ "+" ]
"+" = FirstSet (T)

F
"(" = FirstSet (E)
E = ")"

Now we know these on accepting tokens a check can be perform to see if the next token is in the expecting set and action is perform otherwise. However, when the sub methods for non terminals are called the current immediate follow set must be passed down so that the immediate symbols of that set are not removed. So for example

Sometimes however deletion error recovery should stop at specific tokens that are not contained in the immediate follow set, this is known as the stopping set and again is an additional set to the epilogue of the methods.

This method is known as Follow Set Error Recovery and was Constructed by Niklaus Wirth in the book "Algorithms + Data Structures = Programs".

This kind of stuff is where multiple passes of the AST are required. In these passes further information is added to the tree. So decorating the AST is performed. The reason for this is the future pass may require some information etc. Semantic passing for example so that types can be distinguished.

An interseting alternative to the Follow Set Error Recovery is the Heuristics Based Error Recovery, based on the concepts of Synchronization Points and Weak Symbols. This method is used (for example) in Wirths Oberon compiler and is integrated in the compiler generator Coco/R.

Information on Heuristics Based Error Recovery and Coco/R for various platforms can be found at the university of Linz, just in case you'd like to have a look after the coffee

PS -- A great and unorthodox approach to the parsing topic.
I am looking forward for more.