A handrolled language parser

In my previous post about building a custom lexer I mentioned that, for educational purposes, I created a simple toy programming language (still unnamed). There, I talked about building a tokenizer and lexer from scratch. In this post I’ll discuss building a parser that is responsible for generating an abstract syntax tree (AST) for my language. This syntax tree can then be passed to other language components such as a scope and type resolver, and finally an interpreter.

The parser I made is a recursive descentpackrat parser that uses backtracking. Short of memoizing found AST, there aren’t any other real optimizations. The goal was to create a working parser, not a production parser to distribute or use (or reuse) in any professional sense. Like the lexer, this is an academic exercise to try and hit on some of the points covered by Terence Parr’sLanguage Implementation Patterns book that I recently finished reading.

I’m not going to cover much language theory because I want to jump into what I did and how it’s implemented. There are a lot of resources on context free grammars, LL parsers, left-recursion, ambiguous vs unambiguous grammars, PEG (parsing expression grammars), shift reduce parsing, parse tables, and other subjects related to parsing and language implementation on the internet. I’ll leave explanations of those topics for someone else who is more qualified than me. But, if you are interested and are new to it (like myself), starting with Parr’s book is a great first step and helps clear up some of the theoretical haze that surrounds a lot of programming language theory.

Syntax definitions

Every language is defined by a grammar, where the syntax represents the rules of the grammar. The grammar of my language was grown organically, I didn’t really go into it with any specific syntax. I knew I wanted to have lambdas, the var keyword, and simple stuff like if, else, while, for, function declarations and class declarations. When I wrote the parser I just started with some basic syntax and added to it as I wanted more functionality. I think most people give their languages a bit more thought than I did, but all’s well that ends well.

In general, you can represent your grammar using BNF. Let’s define a really simple language:

word. This is any group of characters at least once with no spaces. foo for example would be a word

number. This is any number of digits (no decimals). Something like 1234

operator. This is a plus sign or a minus sign

token. Either a word or a number

expression. This is either a token or a token with an operator followed by an expression. This example here is what is called right recursive, since the expression references itself on the right hand side of the operation. Recursive descent parsers can’t handle left recursion since it leads to infinite loops.

variable declaration. This is the keyword var followed by a word followed by an = followed by an expression

statement. Either a variable declaration or some expression

if statement. The keyword if followed by a ( followed by an expression, followed by a ) followed by a { followed by one or more statements, followed by a }

If you wanted to you, you could feed this general grammar (modified syntactically) to libraries that can auto generate parsers for you, but that’s no fun and feels like cheating when you are learning.

The Parsers Job

The goal of the parser is to take a strongly typed token stream from the lexer, and to create a syntax tree that we can use. A simple way to think about it is that each bullet point in our grammar can be a class. Imagine we have a class called IfStatement. It might look something like this:

We don’t really care about the keyword if or any other of the special characters like (, ), {, and }, since the important part is that we now have a class that describes what the if statement meant: a predicate, and a list of statements to execute if the predicate is true. This type of class is what is known as a syntax tree. It’s a tree because it references other syntax nodes. Here is an image showing what a syntax tree could look like (image taken from wikipedia)

When you’re done parsing, you will have a root node that references the entire structure of your program. Later parts of the language chain (building out scope, types, and interpreting the code) will go through these definitions and actually evaluate what they mean. Those later passes might also add extra metadata to each node, so the AST is like our master repo of program related metadata. For now the goal is to create these classes.

Capturing Information

Since the parsers goal is to create these classes, it needs to be able to work on an underlying token stream and create meaningful representations from that stream. The parser knows what kinds of syntactical patterns it expects. For example, if we have the following variable declaration and assignment

int x = 5;

We can tell it’s a variable declaration and assignment because it matches the pattern of

The parsers job is to take those tokens in meaningful orders and create an AST from it. When I say “take”, I mean you remove the current token from the head of the stream (or advance the token streams index pointer). Lets say we are parsing that variable declaration above, it might have a token stream that looks like this

We see that the head of the stream is a keyword that can be a valid variable type (int), so we can take it off the list and store it. Then we expect the pattern “word”, “equals”, “expression”, “semicolon”. We can take them one at a time and while it matches keep on going. Certain items like the semicolon you can trash. It’s there to tell the parser when to stop.

Alternatives

Sometimes, however, you can’t determine what an expression will be just by looking at the current token. For example, what does this mean if you only look at the first element?

1 + 1

Is it a token of value 1? Or is it an expression of 1 + 1? Obviously it’s 1 + 1, but the parser can’t always tell. Careful ordering of your parser can avoid most of these ambiguities, but when it can’t, you can either peek into the stream (so you see that the next token is a + so that means expression), or simply try alternatives. The first alternative to match wins!

For alternatives, you try first an expression. If that fails, then you try a token. If that fails, then invalid syntax. Remember that I mentioned that my parser is a packrat parser? All this means is that while it’s trying alternatives it will cache if it found them. Later, when I actually go and try to take a certain branch I can retrieve the already parsed AST from the cache. This cuts down on a lot of extra work.

The Token Stream

In the last post about the lexer, I created a TokenizableStreamBase base class that handles basic snapshot/commit/rollback/consume functionality on an input stream. Here I’m going to re-use it and pass it a stream of tokens, instead of a stream of characters. The parser will instantiate this ParseableTokenStream class (which subclasses the tokenizable stream base) and use it as it’s token stream.

It takes the lexer, lexes the tokens, and creates an underlying token stream that we can do snapshots on. We also have methods to test if the current item on the stream is a specific token type (defined by a known enum):

The parsing stream base also lets us “take” a specific token. If you remember from the last post, all consuming of a lexable item does is advance the internal array index. The important part is that after we Consume, we’ve advanced to the next token in the token stream.

You’ll see in my parser that sometimes I use the Take return value, and sometimes it’s discarded. This is intentional. Even if you don’t intend to use a token in the syntax tree (like a semicolon) you still have to acknowledge that it was part of the expected pattern and advance the token stream.

We can also try an alternate route. If the route function returns a non-null syntax tree we’ll assume the route succeeded and cache it. Later requests for getting syntax trees at that current index will first check the cache before trying to re-build the tree (if it needs to):

The underlying stream in my parser is an array, so the inherited Index property keeps track of where we are in the stream. When we return a memoized syntax tree, we can seek the stream to the index directly after the last memoized token. This means we can easily jump around in our parser stream. Hopefully this makes sense, because if we returned a cached syntax tree that spanned token items 1 through 15, we should jump immediately to token index 16 and continue parsing from there.

For small parsing this works well, but obviously wouldn’t scale with large programs. Still, to change it so that we work on a buffered section of an infinite stream wouldn’t be that much work, and in the end wouldn’t modify how the actual parser behaves. This is all hidden in the shared base class (so the lexer would also improve).

Finally, Parsing

First, to tie in the section above, here is the constructor of the parser:

Next, I’ve defined a few syntax tree classes that the parser will use:

All of the syntax tree containers inherit from the base class Ast. This makes working with syntax trees in the parser easy because everything can be passed around as the base, and it means we can extend the metadata that syntax trees have just by adding to the base class. Hopefully most of the class names are self explanatory (if statement, while loop, method declaration, class dereference) just by class name. If you’re interested in class details you can go to the github and check them out. Suffice to say that they look kind of like the if statement class I pseudocoded earlier.

As an example, let me show one that I reused a lot. The ScopeDeclr AST gets created anytime the parser encounters a { followed by some statements, terminated by }.

ScopedStatements is a list of statements that are found in the scoped block.

I used the ScopeDeclr syntax tree to hold the root node of the entire application. This is because I considered the global scope (starting at the root) to be, well, a scope. The ScopeDeclr also turned out to be extremely useful when building out partial curried functions, a subject I’ll cover in the next post about scope and type definitions.

A statement can either be a class, a method declaration, or an inner statement. I didn’t want to need to put semicolons after class and method definitions, so I don’t test for a semicolon there. I also made semicolons optional, if we can unambiguously determine the grammar without needing semicolons then great, otherwise we’ll use it to terminate a statement if it’s there. Though in reality you need to put in semicolons or the parser will barf. Call it a language quirk.

Here is an inner statement. These are statements I considered valid within scopes such as method declarations, global scope, or inside of classes.

We test if the current token is of type TokenType.New and if so consumes it. Then it expects an expression (the word thing), and then gets a comma delimited list of arguments. There’s no semicolon because this new statement is part of a larger sequence of statements which will contain a reference to this new on the tree. We don’t really know, or care, if the statement is part of a variable declaration, or a print statement, or a function call, or whatever, as long as its valid in the grammar.

Hopefully you can see now how this all continues on. GetStatementsInScope pulls all semicolon delimited statements between a left bracket and a right bracket and returns a scope declaration block with them inside.

Expressions

I wanted to dedicate a specific section on parsing expressions because I struggled with this. These are ones like

I’ll be truthful here, I didn’t think expressions through thoroughly before I started. For every pattern I was able to match I exposed one that I couldn’t. At one point I ran into a bunch of left recursion issues. In the end, expressions, as I’ve “defined” them look like this

This was mostly figured out through trial and error, some pen and paper diagrams, extensive unit tests, and a lot of head scratching. Honestly, out of the whole parser this is what took the longest to get right (at least right enough).

What I did to avoid left recursion, I later realized, looks similar to what wikipedia suggests, which is to create a new intermediate nonterminal. This is the subset of specific terminals I called terminal in the BNF above. So I’m not just matching on expression operator expression since that would recurse endlessly (assuming tail call recursion) or, more likely, just blow up my stack and crash.

The expression parsing code, in the end, matches expressions of the following formats (for example). I made all the examples use a plus sign because I was lazy – any available operator works in any ordering (these cases are from my expression testing unit test)

What we’re doing here is splitting up the operation into 3 different sections

Terminal. This is the first statement. If it’s a terminal parse and return. Terminals aren’t just single tokens, they are terminal expression types (like new, function calls, single operands, simple operations like operand operator operand, etc)

Expressions inside of parenthesis. If we have something like (1 + 1), take the parenthesis out and parse the expression.

Expressions inside of parenthesis, followed by an operator, followed by an expression. If we have (1 + 1) + (1 - a), or (1 + 1) + 2 + 3, take the first section, then the operator, then try the next section

If we have a valid left operand we can parse a basic expression that is of the form

terminal | terminal operator expression

This is right recursive! Sweet, no recursion issues. If you didn’t catch why earlier, check out this stack overflow question. The ordering of parsing here matters, I am parsing from most terms to least terms. If I switched the order (terminal first, then expression), the parser would break since we’d run into the alternative issue I mentioned in a section above.

Conclusion

One thing I didn’t like about how my parser turned is that the entire parsing code ended up in one class. I couldn’t think of a clean way to separate all this out, since we have lots of mutual recursion going on. Each function needed to know about the other one. I thought about passing around the lexer as a state (functional style) but when I tried that it ended up messier than I had hoped.

Anyways, it’s a lot to take in, and I showed a bunch of code, some of which is out of context. Make sure to go to the github if you want to poke around some more. After doing the parser, I wouldn’t do one again by hand. This is pretty tedious and it took me two frustrating weeks to get it all right (thanks for bearing with me Tracy!). This is why you would absolutely use a pre-rolled lexing/parsing solution to get you to your abstract syntax trees. Here is a list of parser generators, and I’m sure there are tons not listed. Lots of smart people have built these and will save you plenty of time in the long run.

Like the exercise with the lexer, I’m glad I did it and I have a much better appreciation for libraries that do this all for you.

The next step (and post) will be to determine proper scoping rules, static typing (and type validation), and add some extra neat features like partial functions to our language. This is going to be done using a scope builder that will use the visitor pattern to iterate over our syntax tree. Having the syntax tree means we can finally start doing interesting stuff and making language decisions.

Disclaimer

If I got something wrong in the post please let me know! Like I’ve mentioned before I’m new at this and just sharing my findings.

Share this:

Related

6 comments

I’m taking a class called Compiler Construction at my university. Our textbook is Modern Compiler Implementation in Java. I’d have to say it’s a great textbook. So far, we’ve gone through writing the token scanner, parser, and part of the type checker. This class helped me understand your post. Thanks for sharing!