Parsing, CFGs, and Type Hacking

This is what I have been playing with for the last day or so.

Haskell has a very nice monadic parser library for predictive parsing (parsec), and a decent lex/yacc-style parser and lexer generator suite (happy and alex). That said, though, it’s more fun and educational to write code than to worry about what’s already been written, I set out to do something similar. In particular, my goals are:

All the language extensions I’ll be using. This is the bare-bones list; the original list was eight or nine lines.. So, if you were wonder whether this is a good post for new Haskellers just learning the language, there’s your answer!

I need some way to represent variables (in the CFG sense). In order to ensure that everything is well-typed, I need some way to keep track of the type of the semantic value associated with each variable. Here’s what I did.

data Var a= Var String

And a right-hand side of each rule will have a sequence of variables and terminals. Again, to keep the type information around, I’ll need a sort of consoperator at the type level. Here is that. I defined a type, and also an operator that makes the type easier to use.

data RHS ab= RHS ab
(&) = RHS -- a convenient operatorinfixr5&

And next, things get hard. I’m using a multiparameter type class, in the fine tradition of Haskell type hacking, to represent a relation between types. My relation is defined in the following comment:

{- (Action a b c) means the following: A production with a right hand side of type: a may be associated with a semantic rule of type: b to produce a rule with semantic result type: c-}class Action abc|ab->c, ac->b

In other words, the Action class will be used to ensure that the result type of a grammar production, the right-hand side of the production, and the type of the associated semantic rule are all consistent with each other. The functional dependencies simply assert that if you know the types of the right-hand side and the semantic rule, this is enough to determine what the result will be after applying the semantic rule; and that if you know the right-hand side and the result type, this is enough to determine what the type of the semantic rule needs to be.

There are a three base cases for this relation:

instance Action (Var x) (x->y) y

This says that if a production has the form A -> B, where A has semantic values of type y, and B has semantic values of type x, then the semantic rule must have type (x->y). If you think about it, this should make sense.

instance Action Char yy

This rule says that if the right-hand side of a production is a signle character (a terminal, not a variable), the semantic rules should be a constant that matches the semantic type for the left-hand variable.

instance Action () yy

This describes the situation for empty productions (sometimes called epsilon or lambda productions). Since leaving out any terms on the right-hand side of a rule isn’t an option, I use (), called “unit” to represent an empty right-hand side.

Those are the base cases. (As a side comment, only the last one is strictly necessary; the first two are basically just syntactic convenience. See below.) Here’s how rules with more than one symbol on the right-hand side are handled.

instance (Action abc) => Action (RHS (Var x) a) (x->b) c

The RHS operator defined earlier is used to build a list of sorts. This rule says that adding a variable to the beginning of the right-hand side of a rule requires adding a parameter to the beginning of the semantic action, and that the result type stays the same. This case handles right-hand sides that begin with a variable.

instance (Action abc) => Action (RHS Char a) bc

Finally, this case handles right-hand sides that begin with a terminal (a character). The types of the semantic rule and result don’t change, since a terminal is known in advance, so there’s no need for it to carry semantic information.

Some more syntactic convenience makes it easier to write grammars. Here I abuse monads to take advantage of the special syntax.

So a rule consists of a left-hand side, a right-hand side, and a semantic rule. They are constrained to match each other by the Action class defined above. A RuleSet is basically just a writer monad for lists of rules, but I defined it by hand just for the fun of it.

It took a while to pick this. All the good arrow-like operators seems to be taken! Nevertheless, it does the job we want fairly well. Notice that even though I’m using an infix operator, there are three operands. The normal usage looks like this:

lefthand==>righthand$semanticrule

You’ll see examples coming up.

The formal definition of a context-free grammar includes four things: a set of variables, a set of terminals, a set of productions, and a special start variable. We’ve got three: variables are values of type Var a. Terminals are values of type Char. Productions are values of type Rule. Next, I need a start symbol. This is defined once, outside of the monadic environment in which rules are defined. At the same time, I through away the result value of the monad, which is useless since I was just exploiting the syntax rather than building a real monad.

(This is a modified grammar I had laying around from a set of compiler course notes. It happens to have left recursion removed, but that’s immaterial really.) There are no type declarations in the entire grammar. Ask GHCi for the type of g, though, and it answers.

g:: (Fractional a) => Grammar a

It correctly inferred that the result type of expr must be Fractional. How? Because the third production for factmore uses the / operator. This means that factmore must be Fractional, and the type ripples upward all the way to the start variable of expr.

The only thing I don’t like at the moment is the need to use an explicit monomorphic binding (the parentheses) to declare non-terminals. If that’s not done, then the compiler thinks non-terminals can have different result types when used in different places, and the types it infers tend to be several pages long! A nice solution to that would be good, but I’m happy with everything else!