About Me

I'm a software engineer in New York City. I created the the Gypsum programming language, and I have worked on optimizing the V8 JavaScript engine for mobile phones. Most of my writing focuses on compilers and programming languages. I'm also interested in 3D graphics, high performance computing, cryptography, and security.

A simple interpreter from scratch in Python (part 3)

So far, we've written a lexer and a parser combinator library for our interpreter. In this part, we'll create the abstract syntax tree (AST) data structures, and we'll write a parser using our combinator library which will convert the list of tokens returned by the lexer into an AST. Once we have parsed the AST, executing the program will be very easy.

Defining the AST

Before we can actually start writing our parsers, we need to define the data structures that will be the output of our parser. We will do this with classes. Every syntactic element of IMP will have a corresponding class. Objects of these classes will represent nodes in the AST.

There are three kinds of structures in IMP: arithmetic expressions (used to compute numbers), Boolean expressions (used to compute conditions for if- and while-statements), and statements. We will start with arithmetic expressions, since the other two elements depend on them.

An arithmetic expression can take one of three forms:

Literal integer constants, such as 42

Variables, such as x

Binary operations, such as x + 42. These are made out of other arithmetic expressions.

We can also group expressions together with parenthesis (like (x + 2) * 3. This isn't a different kind of expression so much as a different way to parse the expression.

We will define three classes for these forms, plus a base class for arithmetic expressions in general. For now, the classes won't do much except contain data. We will include a __repr__ method, so we can print out the AST for debugging. All AST classes will subclass Equality so we can check if two AST objects are the same. This helps with testing.

Boolean expressions are a little more complicated. There are four kinds of Boolean expressions.

Relational expressions (such as x < 10)

And expressions (such as x < 10 and y > 20)

Or expressions

Not expressions

The left and right sides of a relational expression are arithmetic expressions. The left and right sides of an "and", "or", or "not" expression are Boolean expressions. Restricting the type like this helps us avoid nonsensical expressions like "x < 10 and 30".

Primitives

Now that we have our AST classes and a convenient set of parser combinators, it's time to write our parser. When writing a parser, it's usually easiest to start with the most basic structures of the language and work our way up.

The first parser we'll look at is the keyword parser. This is just a specialized version of the Reserved combinator using the RESERVED tag that all keyword tokens are tagged with. Remember, Reserved matches a single token where both the text and tag are the same as the ones given.

def keyword(kw):
return Reserved(kw, RESERVED)

keyword is actually a combinator because it is a function that returns a parser. We will use it directly in other parsers though.

The id parser is used to match variable names. It uses the Tag combinator, which matches a token with the specified tag.

id = Tag(ID)

The num parser is used to match integers. It works just like id, except we use the Process combinator (actually the ^ operator, which calls Process) to convert the token into an actual integer value.

num = Tag(INT) ^ (lambda i: int(i))

Parsing arithmetic expressions

Parsing arithmetic expressions is not particularly simple, but since we need to parse arithmetic expressions in order to parse Boolean expressions or statements, we will start there.

We'll first define the aexp_value parser, which will convert the values returned by num and id into actual expressions.

We're using the | operator here, which is a shorthand for the Alternate combinator. So this will attempt to parse an integer expression first. If that fails, it will try to parse a variable expression.

You'll notice that we defined aexp_value as a zero-argument function instead of a global value, like we did with id and num. We'll do the same thing with all the other parsers. The reason is that we don't want the code for each parser to be evaluated right away. If we defined every parser as a global, each parser would not be able to reference parsers that come after it in the same source file, since they would not be defined yet. This would make recursive parsers impossible to define, and the source code would be awkward to read.

Next, we'd like to support grouping with parenthesis in our arithmetic expressions. Although grouped expressions don't require their own AST class, they do require another parser to handle them.

process_group is a function we use with the Process combinator (^ operator). It just discards the parenthesis tokens and returns the expression in between. aexp_group is the actual parser. Remember, the + operator is shorthand for the Concat combinator. So this will parse '(', followed by an arithmetic expression (parsed by aexp, which we will define soon), followed by ')'. We need to avoid calling aexp directly since aexp will call aexp_group, which would result in infinite recursion. To do this, we use the Lazy combinator, which defers the call to aexp until the parser is actually applied to some input.

Next, we combine aexp_value and aexp_group using aexp_term. An aexp_term expression is any basic, self-contained expression where we don't have to worry about operator precedence with respect to other expressions.

def aexp_term():
return aexp_value() | aexp_group()

Now we come to the tricky part: operators and precedence. It would be easy for us just to define another kind of aexp parser and throw it together with aexp_term. This would result in a simple expression like:

1 + 2 * 3

being parsed incorrectly as:

(1 + 2) * 3

The parser needs to be aware of operator precedence, and it needs to group together operations with higher precedence.

We will define a few helper functions in order to make this work.

def process_binop(op):
return lambda l, r: BinopAexp(op, l, r)

process_binop is what actually constructs BinopAexp objects. It takes any arithmetic operator and returns a function that combines a pair of expressions using that operator.

process_binop will be used with the Exp combinator (* operator). Exp parses a list of expressions with a separator between each pair of expressions. The left operand of Exp is a parser that will match individual elements of the list (arithmetic expressions in our case). The right operand is a parser that will match the separators (operators). No matter what separator is matched, the right parser will return a function which, given the matched separator, returns a combining function. The combining function takes the parsed expressions to the left and right of the separator and returns a single, combined expression. Confused yet? We'll go through the usage of Exp shortly. process_binop is actually what will be returned by the right parser.

Next, we'll define our precedence levels and a combinator to help us deal with them.

any_operator_in_list takes a list of keyword strings and returns a parser that will match any of them. We will call this on aexp_precedence_levels, which contains a list of operators for each precedence level (highest precedence first).

precedence is the real meat of the operation. Its first argument, value_parser is a parser than can read the basic parts of an expression: numbers, variables, and groups. That will be aexp_term. precedence_levels is a list of lists of operators, one list for each level. We'll use aexp_precedence_levels for this. combine will take a function which, given an operator, returns a function to build a larger expression out of two smaller expressions. That's process_binop.

Inside precedence, we first define op_parser which, for a given precedence level, reads any operator in that level and returns a function which combines two expressions. op_parser can be used as the right-hand argument of Exp. We start by calling Exp with op_parser for the highest precedence level, since these operations need to be grouped together first. We then use the resulting parser as the element parser (Exp's left argument) at the next level. After the loop finishes, the resulting parser will be able to correctly parse any arithmetic expression.

E0 is the same as value_parser. It can parse numbers, variables, and groups, but no operators. E1 can parse expressions containing anything E0 can match, separated by operators in the first precedence level. So E1 could match a * b / c, but it would raise an error as soon as it encountered a + operator. E2 can match expressions E2 can match, separated by operators in the next precedence level. Since we only have two precedence levels, E2 can match any arithmetic expression we support.

Let's do an example. We'll take a complicated arithmetic expression, and gradually replace each part by what matches it.

We probably could have defined precedence in a less abstract way, but its strength is that it applies to any situation where operator precedence is a problem. We will use it again to parse Boolean expressions.

Parsing Boolean expressions

With arithmetic expressions out of the way, we can move on to Boolean expressions. Boolean expressions are actually simpler than arithmetic expressions, and we won't need any new tools to parse them. We'll start with the most basic Boolean expression, the relation:

process_relop is just a function that we use with the Process combinator. It just takes three concatenated values and creates a RelopBexp out of them. In bexp_relop, we parse two arithmetic expressions (aexp), separated by a relational operator. We use our old friend any_operator_in_list so we don't have to write a case for every operator. There is no need to use combinators like Exp or precedence since relational expressions can't be chained together in IMP like they can in other languages.

Next, we define the not expression. not is a unary operation with high precedence. This makes it much easier to parse than and and or expressions.

Here, we just concatenate the keyword not with a Boolean expression term (which we will define next). Since bexp_not will be used to define bexp_term, we need to use the Lazy combinator to avoid infinite recursion.

We define bexp_group and bexp_term pretty much the same way we did for the arithmetic equivalents. Nothing new here.

Next, we need to define expressions which include the and and or operators. These operators actually work the same way arithmetic operators do; both are evaluated left to right, and and has higher precedence.

Just like process_binop, process_logic is intended to be used with the Exp combinator. It takes an operator and returns a function which combines two sub-expressions into an expression using that operator. We pass this along to precedence, just like we did with aexp. Writing generic code pays off here, since we don't have to repeat the complicated expression processing code.

Parsing statements

With aexp and bexp finished, we can start parsing IMP statements. We'll start with the humble assignment statement.

You'll notice that both the if and while statements use stmt_list for their bodies rather than stmt. stmt_list is actually our top level definition. We can't have stmt depend directly on stmt_list, since this would result in a left recursive parser. And since we want if and while statements to be able to have compound statements for their bodies, we use stmt_list directly.

Putting it all together

We now have a parser for every part of the language. We just need to make a couple top level definitions:

def parser():
return Phrase(stmt_list())

parser will parse an entire program. A program is just a list of statements, but the Phrase combinator ensures that we use every token in the file, rather than ending prematurely if there is are garbage tokens at the end.

def imp_parse(tokens):
ast = parser()(tokens, 0)
return ast

imp_parse is the function that clients will call to parse a program. It takes the full list of tokens, calls parser, starting with the first token, and returns the resulting AST.