Introduction

In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we're done, we'll be able to handle expressions such as 1 + 2 * -(-3+2)/5.6 + 3, and hopefully you'll have gained the tools to handle much more.

My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I've been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().

If you want to follow the examples on your computer, you can install PlyPlus with pip install plyplus.

Working knowledge of Python is required for the implementation section.

Grammars

For those of you who don't know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it's made of.

Each pass of the parser will look for either add+number or number+number, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.

Here are the steps the parser will take:

number + number + number + numberfirst pass turns all numbers into a 'number' rule

[number + number] + number + numberthe parser found its first pattern!

[add + number] + numberafter converting the pattern, it finds the next one

[add + number]

add

The sequence of symbols has turned into a hierarchy of two simple rules: number+number and add+number, and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.

Operator Precedence

Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:

By telling add that it operates on mul, and not on numbers, we are giving multiplications the precedence.
Let's pretend-run this grammar on 1 + 2 * 3 * 4 with our magical parser that is in my head:

number + number * number * number

number + [number * number] * numberthe parser doesn't know what a number+number is, so this is his next pick

number + [mul * number]

number + mul

???

Now we are in a bit of a pickle! The parser doesn't know what to do with number+mul. We can tell it, but if we keep looking we'll find that there are many possibilities we didn't cover, such as mul+number, add+number, add+add, etc.

So what do we do?

Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!

But if mul can become add, and number can become mul, we have extra lines that do nothing. Removing them, we get:

add: add '+' mul
| mul
;
mul: mul '*' number
| number
;

Let's pretend-run on 1 + 2 * 3 * 4 again with this new grammar:

number + number * number * numberThere's no rule for number*number now, but the parser can "get creative"

number + [number] * number * number

number + [mul * number] * number

number + [mul * number]

[number] + mul

[mul] + mul

[add + mul]

add

Success!!!

If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!

Running the parser

By now we have a fairly good idea of how we want our grammar to work. Let's apply it and write an actual grammar:

If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run pip install plyplus and paste the above commands inside python (make sure to put the actual grammar instead of '...' 😉 ).

Shaping the tree

Plyplus automagically creates a tree, but it's not very optimal. While putting number inside mul and mul inside add was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to "expand" (i.e. remove) rules by prefixing them. A @ will always expand a rule, a # will flatten it, and a ? will expand it if and only if it has one child. In this case, ? is what we want.

Parenthesis and Other Features

We are missing some obvious features: Parenthesis, unary operators (-(1+2)), and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.

The important concept is to add a new rule, we'll call atom. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we'll make sure it's always expanded, by adding @ to it.

The obvious way to allow spaces is with something like add: add SPACE add_symbol SPACE mul | mul;, but that's tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.

Make sure you understand it, so we can proceed to the next step: Calculating!

Calculating!

We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.

This is the part we start writing code, so I need to explain two things about the tree.

Each branch is an instance with two attributes: head, which is the name of the rule (say, add or number), and tail, which is the list of sub-rules that it matched.

By default, Plyplus removes unnecessary tokens. In our example, the '(' and ')' will already be removed from the tree, as well as neg's '-'. Those of add and mul won't be removed, because they have their own rule, so Plyplus knows they're important. This feature can be turned off to keep all tokens in the tree, but in my experience it's always more elegant to leave it on and change the grammar accordingly.

Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It's your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let's see how it's done.

Each method corresponds to a rule name. If a method doesn't exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.

I use float() to parse numbers, because I'm lazy, but I can implement it using the parser as well.

I use the operator module for syntactic beauty. operator.add is basically 'lambda x,y: x+y', etc.

with the new grammer. As far as I can see, parsing is done from left to right. So, what makes our grammer to skip first "number" while searching for a pattern? Also, after creating first "mul", what makes grammer to skip transforming "mul -> add" and check for "mul + number"?

Parsers will only match a rule if it leads to a full solution of the text. An error occurs if there are no solutions, or if there is more than one solution (usually). How do they do it? That's complex parser theory, but in simple terms, an LR parser keeps a state, and it will only match a rule if it maintains a 'good' state. A pattern that leads it into a 'bad' state is ignored.