I am happy to announce initial ports of PetitParser to Java and Dart. The repositories contain complete and usable ports of the original Smalltalk implementation, unit tests and example grammars.

The three implementations are very similar in their API and the supported features. If possible, I tried to adapt common practices and features of the target language. In this blog post I like to point out some differences and design decisions I took.

Production actions

Due to the lack of closures the Java solution gets a bit bulky. Hopefully this issue will finally be fixed with Java 8. The type declarations are quite nice as documentation of input and output values of the production action:

Composite grammars

For the definition of complex grammars with possibly recursive rules you subclass CompositeParser in all three implementations:

The details of the Smalltalk implementation are explained in this blog post. Production rules are defined as methods and instance-variables of the same name. The parser constructor initializes all production rules to dummy parsers that are eventually replaced by the actual definitions using reflection. As a real-world example have a look at this JSON grammar.

The Java solution is very similar to the one in Smalltalk. To make things a bit more Java-ish, methods are annotated with @Production to mark them as a production. In the future, I plan to avoid the use of inner classes for production actions by introducing new annotation called @Action. Again there is a JSON grammar to look at.

Due to the (current) lack of reflection in Dart I had to come up with something else. For Dart I introduced a DSL that allows one to define productions, reference productions, redefine productions and assign production actions to existing productions. The implementation is simple and I like it much better than the others. The JSON grammar in Dart is pretty concise.

Performance

Comparing the performance of the implementations is interesting. For this preliminary evaluation I took the XML parser and parsed 2000 times a medium sized XML file. This is the ranking:

Java: 815ms (Java HotSpot 64-Bit 1.7)

Smalltalk: 4069ms (Pharo 1.3 on Cog)

Dart: 6156ms (Dart SDK version 13851, M1)

Dart: 7048ms (Dart SDK version 8638)

Dart: 14332ms (Dart SDK version 8344)

Interestingly Java is doing really well, even if I use auto-boxed characters everywhere and didn’t care to optimize the implementation yet.

The Smalltalk implementation is about 4 times slower. With this implementation I worked the longest and I doubt that it can be optimized further.

Dart lost this evaluation, but I am sure that the code can be further improved. I am suspecting that a problem is that Dart has no notion of characters and needs to represent them as integers or single character strings. Also, in Dart there is no efficient way to figure out if a character is a whitespace, a letter or a digit. To do this correctly I had to fall back to regular expressions, which likely kills an otherwise fast VM. The latest implementation works directly with character codes.

Availability

To sum up, the three implementations are open-source and available in their respective public repositories:

In a previous post I described the basic principles of PetitParser and gave some introductory examples. In this blog post I am going to present a way to define more complicated grammars. We continue where we left off the last time, with the expression grammar.

Writing parsers as a script as we did in the previous post can be cumbersome, especially if grammar productions that are mutually recursive and refer to each other in complicated ways. Furthermore a grammar specified in a single script makes it unnecessary hard to reuse specific parts of that grammar. Luckily there is PPCompositeParser to the rescue.

Defining the Grammar

As an example let’s create a composite parser using the same expression grammar we built in the last blog post:

Every production in ExpressionGrammar is specified as a method that returns its parser. Productions refer to each other by reading the respective instance variable of the same name. This is important to be able to create recursive grammars. The instance variables themselves are typically not written to as PetitParser takes care to initialize them for you automatically.

Next we define the productions term, prod, and prim. Contrary to our previous implementation we do not define the production actions yet; and we factor out the parts for addition (add), multiplication (mul), and parenthesis (parens) into separate productions. This will give us better reusability later on. We let Pharo automatically add the necessary instance variables as we refer to them for the first time.

ExpressionGrammar>>term^add/prod

ExpressionGrammar>>add^prod,$+asParsertrim,term

ExpressionGrammar>>prod^mul/prim

ExpressionGrammar>>mul^prim,$*asParsertrim,prod

ExpressionGrammar>>prim^parens/number

ExpressionGrammar>>parens^$(asParsertrim,term,$)asParsertrim

Last but not least we define the starting point of the expression grammar. This is done by overriding start in the ExpressionGrammar class:

ExpressionGrammar>>start^termend

Instantiating the ExpressionGrammar gives us an expression parser that returns a default abstract-syntax tree:

Domain-specific languages (DSLs) are increasingly used as embedded languages within general-purpose host languages. DSLs provide a compact, dedicated syntax for specifying parts of an application related to specialized domains. Unfortunately, such language extensions typically do not integrate well with existing development tools. Editors, compilers and debuggers are either unaware of the extensions, or must be adapted at a non-trivial cost. Furthermore, these embedded languages typically conflict with the grammar of the host language and make it difficult to write hybrid code; few mechanisms exist to control the scope and usage of multiple tightly interconnected embedded languages.

In this dissertation we present Helvetia, a novel approach to embed languages into an existing host language by leveraging the underlying representation of the host language used by these tools. We introduce Language Boxes, an approach that offers a simple, modular mechanism to encapsulate (i) compositional changes to the host language, (ii) transformations to address various concerns such as compilation and syntax highlighting, and (iii) scoping rules to control visibility of fine-grained language changes. We describe the design and implementation of Helvetia and Language Boxes, discuss the required infrastructure of a host language enabling language embedding, and validate our approach by case studies that demonstrate different ways to extend or adapt the host language syntax and semantics.