At this point we've got a pretty solid recursive descent parser which makes it easy to describe parsers for data which are presented largely sequentially: where what we parse earlier in the stream dictates how we parse what comes later. Formally, our parser is good for LL(k) grammars and, as we saw in the previous post, it's pretty easy to build a parser for JSON, which is LL(1).

Infix notation

We humans tend to read and write sequentially; hence, many representations of data have this sequential characteristic. One important exception to this is mathematical notation. Consider a calculator application in which we want to parse the input:

+ 3 * 4 2

This is prefix notation. Like S-expressions and JSON our parser is very happy to process this type of input because it's sequential: we first read the + which tells us to "recursively parse two more expressions and add them". The operators * and + come before their operands. Sadly, no one wants to write a formula like this. We'd far rather use infix notation in which the operators come between the operands:

3 + 4 * 2

Here we have to read the 3 and put it in a safe place before we get to the +. Now at least the parser knows that it needs to read one more expression. But should it read just the 4, or the complete remainder of the string 4 * 2? At school I was taught BIDMAS, which means that I know that the multiplication should happen before the addition. That's the hard bit about infix notation: our parser must take precedence (and associativity1) into account. These are properties of the operators which dictate how terms are grouped; they provide implicit parentheses in our string. BIDMAS tells us that * has higher precedence than +, so our example should be grouped as

3 + 4 * 2 = 3 + (4 * 2)

Now we know that after parsing 3 and + we should parse the rest of the string. But the parser won't know this until it has parsed the 4 (also stashing that somewhere safe) and got to the *.

If we want this parser framework to be a serious tool with which a real programming language could be parsed, we'll need to be able to handle infix notation. It is possible to build an infix parser with our current parsing tools with precedence climbing. But this would be painful to implement, hard to debug, and inefficient. It would also be more static than I want, in that it would only offer a fixed number of precedence levels. I would like to make it relatively easy for new operators to be added dynamically to the parser. After all, Swift has this feature, and I have benefitted from it greatly in making this library.

Pratt parser

The recursive descent approach does almost everything we want from a parser succinctly and cleanly. I therefore want to make only small changes to the framework to accommodate operator precedence. To our rescue comes the Pratt parser (Wikipedia article). The introduction to this paper does a nice job of summing up the raison d'être of this little project

...we find some language implementers willing to incorporate as much syntax as possible provided they do not have to work hard at it...

I won't give a tutorial of how the Pratt parser works. You can either read the paper, or check out some other online resources: Wikipedia, blog, blog. Instead I'll sketch out what you need to know to drive this part of framework.

Calculator demo

To demonstrate how to handle infix (and prefix, and postfix) notation, let's build a simple calculator demo. There are lots of static parts to this so, like the JSON parser, I will house them as static members of a struct:

I sure do love me some strong typing. Swift occupies a middle ground between more functional languages (F#, ML, Haskell), and more C-like languages (C, C++, C#). I think that, on the whole, Swift does a grand job of bridging these worlds; but there are a few warts, generic signatures like this being one of them. As you can see, OperatorPrecedence is constructed with two sub-parsers. opFormat is used to parse Strings which represent operator symbols. The primary parser is used to parse primary expressions: those things which the operators stick together. For example, in the expression:

3 + y * 5

+ and * are the operators; 3, y, and 5 are the primary expressions. Primary elements are the most tightly-bound in an expression and have higher precedence than any other operator we might add.

For this simple tool, the opFormat looks for one of the single-character symbols +, *, %, \, or ^. These have their usual meanings for anyone familiar with C, with the exception of ^, which I'm using to represent exponentiation: i.e., we represent 23 as 2^3.

The primary parser is more complicated. Whenever LateBound appears we know there's going to be some recursion. In this case it's because we want our calculator to be able to handle expressions containing parentheses. We do this by treating parenthesized subexpressions as primary, capturing their maximum precedence2.

I am pleased how simple it is to define the function call parsers. The sin, cos, etc. at the end of each line are the actual runtime functions provided by Swift, and we're using the pipe operator |> to incorporate them into a parser.

So much for static components. As I mentioned above, one of my desiderata for the parser was that it would be easy to add operators dynamically. This is done by adding OperatorHandlers. OperatorHandler is an enum:

where each case holds a "builder" method which implements the operation, and an integer precedence. LeftInfix and RightInfix are left- and right-associative infix operators, which take two arguments. Prefix and Postfix operators are unary, and can be thought of as infix operators which only have a right or left hand side, respectively.

In our calculator there's now a one-time initialization (no static constructors in Swift) to setup the things I couldn't fit into declarative statements. This includes adding the operators and their handlers:

A higher precedence value indicates that an operator binds tighter. So with the precedences and associativities I've used here, we'll get

3 + 4 ^ 6 * 8 + 2 = (3 + ((4 ^ 6) * 8)) + 2

Because it's common for some symbols to be used both as infix and prefix, with different meanings, I've made sure OperatorPrecedence can handle this. The most common examples of this are unary + and -. In the expression

9 * -(4 - 2)

the first - character is a prefix operator indicating unary negation, whereas the second - is the infix symbol for binary subtraction.

And—modulo a bit of plumbing to honor the Parser protocol—that's pretty much it (see the full structhere). As always, feel free to pull the code from GitHub and have a go!

Next

As listed in the previous post, the last piece of this puzzle is high-quality error handling. Actually, any error handling would be an improvement.

Associativity tells us how operators of like precedence are grouped. If we say that + is left associative, and * is right associative we get the groupings:

7 + 9 + 3 = (7 + 9) + 3
7 * 9 * 3 = 7 * (9 * 3)

I wasn't taught about this at school because we only dealt with associative algebras where it didn't matter... at least, I'm pretty sure I didn't do the cross product until University.↩

This could also be achieved by making the ( and ) symbols be pre- and postfix operators with appropriate precedences and semantics. I think this way is clearer.↩

Posts About Swift

Swift is a new language from Apple. At first glance there's a lot to like about this, so I thought I'd spend some time learning it. These are some blog posts about things that come up as I figure things out!