Parsing the untyped $\lambda$-calculus with Parsec

Or, "Parsing combinators with parser combinators"

Posted on June 24, 2015

The book Types and Programming Languages (briefly, TAPL) is a popular introduction to type systems and programming language theory. Starting with the untyped \(\lambda\)-calculus, TAPL walks the reader through the construction of a simple expression-based language, focusing on type-checking and evaluation. One of the first exercises is an evaluator for the untyped \(\lambda\)-calculus, in OCaml.

I’ve been working through the book in Haskell, which involves a pretty straightforward transcription from OCaml to Haskell. While the book gives an implementation of the evaluator, it doesn’t include any discussion of parsing \(\lambda\)-expressions such as \(\lambda x.\lambda y.x\;y\). Instead, to play around with the evaluator you must pass it an encoding of the term. That’s a real hassle, so let’s build a parser for such expressions.

The heavy lifting for this parser comes courtesy of the Haskell library Parsec. Parsec provides a monadic parsing system, which along with do notation provides a nice DSL for parsing. First, let’s import what we need:

Info is used to hold row and column information about the terms as they are parsed, in case such information is necessary for error messages later.

The data constructors TmAbs and TmApp take predictable arguments - for an abstraction we track the name of the variable as a string, and an application stores the two terms involved. But why are there numbers stored in each TmVar?

Caveat: The Wikipedia article starts numbering at 1, but TAPL (and this post, as a result) start numbering at 0. So \(\lambda x.x\) is \(\lambda.1\) in the Wikipedia article, but \(\lambda.0\) for our purposes. Thanks platz for pointing this out.

The first number is the De Bruijn index1, which cleverly encodes the variables in a nameless representation by storing “how far” the variable is from its binding \(\lambda\). The number represents how many other \(\lambda\)-abstractions (which can be simply called “binders”) there are in the scope of the variable. So, for example, the identity term \(\lambda x.x\) can be written as \(\lambda.0\) and our friend \(\lambda x.\lambda y.x\;y\) from before becomes \(\lambda.\lambda.1\;0\). This nameless representation does away with any issues caused by name collisions; more information about its advantages can be found in the link above.

In order to calculate a variable’s de Bruijn index, we will need to keep track of a list of bound variables. Hence we will use the following type alias:

typeBoundContext=[String]

The second number in the TmVar data constructor stores how many bound variables are in the variable’s scope, and is used as a sanity check in TAPL’s evaluator.

Munging info

Before we start writing the parser, we’ll need a convenience function which will produce the Info we need during parsing. Parsec tracks its position within the source as it parses with the SourcePos type. We will use this to grab the row and column position:

In order to use this function, we of course need a SourcePos to call it on. To get one of these, we first need to know how building parsers in Parsec works.

Parser combinators

Parsec parsers are built up by composing a variety of parser combinators. A combinator is technically a function with no free variables, i.e. one depending only on its arguments; some common examples are the indentity \(I \equiv \lambda x.x\), or the constant function \(K \equiv \lambda x.\lambda y. x\). In the world of functional programming, however, our mental model of a combinator is not necessarily this definition - instead, we think of combinators as simple, self-contained building blocks with which we can construct more complicated functions. For example, the “SKI combinator calculus” is a system which only allows us to work with the combinators \(K\) and \(I\) above, as well as the substitution combinator \(S \equiv \lambda x.\lambda y.\lambda z.(x\;z)\;(y\;z)\). We can apply them to each other; for example, \(I S = S\). From these simple combinators we can build much more complex ones; an interesting example is \(S I I\), which takes some input and applies it to itself. In fact, any expression in the untyped \(\lambda\)-calculus can be written as a combination of the \(S\),\(K\), and \(I\) combinators!

This same spirit of complexity via composition drives Parsec. The library provides some simple parsers, like letter, which matches a single letter, or char c, which matches whatever character c is. Parsers have the type Parsecsua, which we can break down like so:

s is the type of the input, such as String

u is the type of the “user state”, i.e. whatever data you want to carry around as you parse

a is the type of the parser’s output

In our case, we will be parsing Strings into Terms, and we will need to carry around a context storing which \(\lambda\)-abstractions we’ve seen in order to convert to de Bruijn notation, which will be a list of Strings as we mentioned earlier. So our final parser will have type ParsecStringBoundContextTerm. That’s a bit of a mouthful, so lets use a type alias:

typeLCParser=ParsecStringBoundContextTerm

These basic parsers can be combined into more complex beasts with a number of provided functions. One of the usual suspects is the infix function <|> (which you may recognize from the Alternative typeclass). If p and q are two parsers, then p <|> q is a parser which tries parsing with p, and if that fails, parsing with q. So letter <|> char '\'' matches either a letter, or a “prime” ’.

In fact, this is part of the first building block we will need. We will allow variables which are strings consisting of letters or primes, such as “x”, “y”, “x’”, or “lol”. The parser for this is

parseVarName::ParsecStringuStringparseVarName=many1$letter<|>char'\''

The stranger here is many1, which is a rather predictable function. Given a parser p, many1 p will match 1 or more of the things p parses. In our case, this means 1 or more letters or primes - i.e. a string like described above. Note that the type of the state is left as a variable.

In order to use a parser, we need to run it. Let’s give ourselves a helper function for running the parsers we make as we go:

As the type signature suggests, parseWith takes a parser and a string and either gives you an parsing error, or whatever the output of the parser is. The empty list we hand it will be used later as the initial state for our parser (an empty context). The string “untyped lambda-calculus” is used as the source name when Parsec prints errors.

The result of a call to parseWith is EitherParseErrora. A successful parsing attempt will return Rightx, where x is whatever was parsed. If there is a parsing error, we get a Lefterr instead, where err is a ParseError. An explanation of what Left and Right are can be found here.

Here are a few examples of using the variable name parser. Notice what it accepts and rejects2:

Notice that when the parser hits an invalid character right off the bat, it fails, because we wanted 1 or more characters. But if it has some valid characters and hits an invalid one, it stops parsing and returns the good stuff. Then it can continue trying another parser on the invalid part in more complex parsers.

Monadic parsing

The type Parsecsu, with the a dropped, has kind * -> *, i.e. it is a type constructor, like Maybe or Eithera. Fixing a type for the input and the user state, Parsecsu is a monad. Recall that to make a monad out of a type constructor m, one must provide implementations of functions return::a->ma and (>>=)::ma->(a->mb)->mb. For Parsec parsers, these functions work like so:

return

return x creates a parser which reads no input, and outputs x. For example:

parseWith(return"output1")""parseWith(return"output2")"This is not read."

Code output:

Right "output1"
Right "output2"

Bind, i.e. (>>=)

p >>= f runs p, then passes the output of parsing with p to f. Recall the type signature for (>>=): in this case, Parsecsua->(a->Parsecsub)->Parsecsub. So passing the output of parsing with p to f gives us a parser, and we run it on the remaining input. Here is a particularly contrived example:

announceLetterc=return$"The first letter is "++[c]parseWith(letter>>=announceLetter)"abc"

Code output:

Right "The first letter is a"

It’s worth looking at what (>>) does as well, even though it can be derived from (>>=). p >> q is a parser which runs p on the input, discards the result, then runs q on the remaining input. So, for example:

Parsing terms

Let’s begin building the parsers for the different types of terms. The abstraction parser is the most involved, and lays the groundwork for the stateful part of the parsing, so we will start with that.

We take the term parser in as an argument to parseAbs so that we can develop the parser step-by-step without IHaskell complaining that the term parser is undefined. The abstraction parser depends on the term parser and vice versa. If this was just in one file, then we could refer to the term parser directly.

First, we match a backslash, which begins the \(\lambda\)-abstraction (the backslash syntax is inspired by Haskell). Next, we parse the subsequent variable name and store it. As we mentioned before, the state we carry around is a list of bound variables, so after we see the variable we push it onto the front of the list using modifyState, which applies the given function to the state. Next we pass by the dot after the variable, and parse the term in the body of the \(\lambda\)-abstraction. Note that we haven’t defined a parser for general terms yet; we can define it once we’ve laid out how to parse each type of term3.

After parsing the body term, we pop abstraction’s variable off of the context list, since we are leaving the scope of the abstraction. Having completed the parsing, we grab the SourcePos using getPosition and return a TmAbs filled in with all the necessary data we’ve parsed.

Now let’s move on to parsing variables. When we parse a variable, we need to return a TmVar with the correct de Bruijn index. This index is the position of the variable in the context list, which is the state we store while parsing. If the variable name isn’t found in the list, then it hasn’t been bound anywhere and is free. This provides a small challenge though - what number should we use for the index of a free variable? In TAPL, the author defines a function for printing elements of Term as normal lambda expressions, but this function has no support for free variables (printing an error in their presence) so we will also elide the challenge of indexing and naming free variables by only parsing terms with no free variables (i.e., combinators). Hence the alternate title for the post: “Parsing combinators with parser combinators”.

Below, we see an implementation for the variable parser:

parseVar::LCParserparseVar=dov<-parseVarNamelist<-getStatefindVarvlistfindVar::String->BoundContext->LCParserfindVarvlist=caseelemIndexvlistofNothing->fail$"The variable "++v++" has not been bound"Justn->dopos<-getPositionreturn$TmVar(infoFrompos)n(lengthlist)

It works as we’ve discussed: first, we parse a variable name, then grab the BoundContext list from the parser state. The findVar function takes the variable name and list of bound variables, and returns a TmVar with the appropriate index when it can, failing otherwise.

Finally, we need a parser which can handle applications. Now, ideally, once we had our application parser parseApp, we would be able to say something like:

parseTerm=parseApp<|>parseAbs<|>parseVar

However, this would lead to an infinite loop: the parseApp function would make a call to parseTerm for each space-separated term there is in the application. Moreover, parseAppmust show up before parseAbs in the definition of parseTerm, because otherwise in a case like “\(\lambda x.x \; \lambda y.y\)” the abstraction parser would consume the first abstraction, which is awfully short-sighted because then the parser doesn’t see the entire terms as an application. But this means that when parseApp makes its call to parseTerm, it will just repeatedly call parseApp over and over again as that is the first parser it tries when running parseTerm.

We can fix this by parsing application terms and non-application terms separately. When we want to parse an application, we run the non-application parser on a space-separated series of terms. Since application in the \(\lambda\)-calculus is left-associative, we can parse a string like “M N O”, where M, N, and O are terms, as “(M N) O”. Parsec includes a function which can help us in this situation:

chainl1::Parsecsua->Parsecsu(a->a->a)->Parsecsua

Essentially, chainl1 p q is a parser which matches 1 or more of whatever p parses, then performs a left fold with the function returned by the q parser. You can see it used in practice in the final part of our parser:

Coda

Yacc is a parser generator, which means you write the grammar for the language you want to parse, and Yacc will spit out a parser for such a language in C or Java. Bison is the GNU version of Yacc, with a punning name in the GNU tradition.

Once you get the hang of it, Parsec makes writing parsers pretty fun. The parser combinator approach seems near-fetishized in the Haskell community; one oft-cited reason for their greatness is the fact that parser combinators allow us to write parsers in the host language (Haskell in this case) without needing to write a specification in some other language (the Yacc/Bison approach4). Having little experience with parsers myself, I can’t attest to this particular strength, but the fact that I could knock out a small parser in one sitting having never worked with Parsec before is a testament to its ease of use.

If you would like to see the parser implementation together in one place, instead of spread throughout this post, you can find it here. The parser and evaluator can be found together in this folder.

Caveat: The Wikipedia article starts numbering at 1, but TAPL (and this post, as a result) start numbering at 0. So \(\lambda x.x\) is \(\lambda.1\) in the Wikipedia article, but \(\lambda.0\) for our purposes. Thanks platz for pointing this out. ↩

The result of a call to parseWith is EitherParseErrora. A successful parsing attempt will return Rightx, where x is whatever was parsed. If there is a parsing error, we get a Lefterr instead, where err is a ParseError. An explanation of what Left and Right are can be found here. ↩

We take the term parser in as an argument to parseAbs so that we can develop the parser step-by-step without IHaskell complaining that the term parser is undefined. The abstraction parser depends on the term parser and vice versa. If this was just in one file, then we could refer to the term parser directly. ↩

Yacc is a parser generator, which means you write the grammar for the language you want to parse, and Yacc will spit out a parser for such a language in C or Java. Bison is the GNU version of Yacc, with a punning name in the GNU tradition. ↩