A quick intro to writing a parser with Treetop

Treetop is a Ruby library that allows you to create parsers easily by describing them using a Parsing Expression Grammar (PEG). Writing a parser using Treetop is a fairly pain-free process, but getting started can be non-trivial, especially if you’re not familiar with PEGs, so this is going to be a fairly short ‘getting started’ guide. To keep things simple our example parser is going to process a subset of S-Expression syntax like so:

Getting Started

The Ruby class Parser which will contain the API for interacting with the parser

The Treetop grammar file in which we will create all the rules that define how our parser will behave

A series of simple classes that subclass Treetop::Runtime::SyntaxNode, these will allow us to easily describe our parsed structure by having each type of entity map directly to a custom syntax node class

The first thing we need to do is create skeletons for the three files involved:

1# In file parser.rb 2require'treetop' 3 4# Find out what our base path is 5base_path=File.expand_path(File.dirname(__FILE__)) 6 7# Load our custom syntax node classes so the parser can use them 8requireFile.join(base_path,'node_extensions.rb') 910classParser1112# Load the Treetop grammar from the 'sexp_parser' file, and 13# create a new instance of that parser as a class variable 14# so we don't have to re-create it every time we need to 15# parse a string16Treetop.load(File.join(base_path,'sexp_parser.treetop'))17@@parser=SexpParser.new1819end

1# In file sexp_parser.treetop2grammarSexp34end

1# In file node_extensions.rb2moduleSexp34end

Basic Parser API & Error Reporting

Our API is going to be very simple, the Parser class will only have only public method: parse. Error reporting for the moment will also be minimal, however if you are making a parser that will be used widely then this is definitely not an area you should skimp on. Inscrutable error messages can make an otherwise well-written parser very frustrating to deal with.

1# In file parser.rb 2classParser 3 4... 5 6defself.parse(data) 7# Pass the data over to the parser instance 8tree=@@parser.parse(data) 910# If the AST is nil then there was an error during parsing11# we need to report a simple error message to help the user12if(tree.nil?)13raiseException,"Parse error at offset: #{@@parser.index}"14end1516returntree17end18end

Abstract Syntax Tree Nodes

In order to be able to build a meaningful tree structure from our input we need to be able to distinguish between different types of nodes, for example:

The Parser Rules

Now that we have all of our other components in place we can start actually defining our PEG rules. We’ll start with the smallest parts: Identifier, FloatLiteral, StringLiteral, space and IntegerLiteral

Each rule above comes in two parts: first the regular expression that defines what this entity looks like, and second what type of syntax node it should be turned into. Treetop looks inside the <...> and matches the name up with the custom syntax nodes that we defined earlier.

Rules in Treetop are written using what amounts to a superset of regular expressions, more info can be found at the Treetop website.

You might be wondering why we have a PEG rule for space… This is so we can allow whitespace in between entities in the input string, if we did not have a rule for this we would have to be very strict about how source files were laid out.

Next let’s define the rule for expressions and bodies (the inner part of an expression), these will build upon the rules we have defined already and are an example of what makes Treetop so great: Treetop allows you to write rules by easily composing sets of smaller rules to make a larger one. This code re-use makes parsers based on PEGs like Treetop very easy to maintain and expand.

This method is going to traverse the entire tree and strip out any nodes that are not one of our custom classes, that way we are only left with nodes we care about. Let’s try our example again and see what we get:

Much better! Now we have a nice clean representation of our input. But to be honest for most cases this is still more information than we need. Let’s alter our custom syntax node classes so they can output a simpler representation of themselves using simple nested sets of arrays and native Ruby data-types, we can achieve this in a simple fashion by giving each node a to_array method and using the delegation pattern:

You can see here that each complex node (Expression, Body) delegates the job of producing the array down to it’s children, and each primitive child node (StringLiteral, FloatLiteral, Identifier) knows how to return a simple version of itself. Let’s add some code to the Parser class to use this and then try it out:

While the example here is fairly simplistic it touches all the parts needed to make a more complex parser using Treetop. You can have a look at the completed code from this example at GitHub, and if you want to have a look at a more involved example you can check out the reference parser for the Koi programming language which is also written using Treetop.