Things that amuse me

Wednesday, June 10, 2009

More LLVM

Recently someone asked me on #haskell if you could use the Haskell LLVM bindings to compile some abstract syntax to a Haskell function. Naturally I said yes, but then I realized I had only done it for a boring language with just one type. I had no doubt that it could be done for more complicated languages with multiple types, but it might not be totally obvious how. So I decided to write a simple compiler, and this blog post is the result. First, a simple example:

What has happened is that the string has been parsed to an abstract syntax tree, translated into LLVM code, then to machine code, and finally turned back into a Haskell callable function. Many things can go wrong along the way, like syntax and type errors, so compile returns a Maybe type to indicate if things went right or wrong. (A more serious version of the compile function would return an error message when something has gone wrong.) The definition of the compilation function is simple and illustrates the flow of the compiler

The context Translate is there to limit the types that can actually be translated; it's a necessary evil and exactly what types are allowed depends on how advanced we make the compiler. Had we ignored the Maybe type the definitions would have been

compile = translate . toTFun . mParseUFun

which says, first parse to the type UFun (untyped expressions), then type check and turn it into the type TFun a, and finally translate TFun a into an a by LLVM compilation. Let's see how this all works.

The UExp module

The first step is to just define an abstract syntax for the expressions that we want to handle. I'm only allowing leading lambdas (this a very first order language), so there's a distiction between the top level UFun type and the expression type UExp. The U prefix indicates that this version of the syntax is not yet type checked. The definition is pretty boring, but here it is:

Naturally, we want to be able to show the expressions, if nothing else so for debugging. So I make a Show instance that shows them in a nice way respecting operator precedences etc. There's nothing exciting going on, the large number of lines is just to cover operator printing.

We also want to be able to parse, so I'm using Parsec to parse the string and produce an AST. Again, there's nothing interesting going on. I use the Haskell lexical analysis provided by Parsec. This is available as a TokenParser record, which can be conveniently opened with the RecordWildcard notation TokenParser{..}.

The parser is packaged up in mParseUFun which returns an AST if it all worked.

The TExp module

Since the LLVM API is typed it's much easier to translate a typed abstract syntax tree than an untyped abstract syntax tree. The TExp module contains the definition of the typed AST and the type checker that converts to it. There are many ways to formulate type safe abstract syntax trees. I've chosen to use GADTs. I've also picked to represent variables (still) by identifiers, which means that the syntax tree is not necessarily type safe. Furthermore, I've chosen a very limited way to represent function application since this is all I need for this example. The variantions on this are endless.

So for instance, UApp "+" [UVar "x", UDbl 2.2] will be represented by TDblOp DAdd (TVar "x") (TDbl 2.2) which has type TExp Double. So the type of the expression is now accurately reflected in the type of the syntax tree. Even the UTyp type now has a typed equivalent where the real type is reflected. For completeness, here's some code for pretty printing etc.

It might look funny that the existential type is written with a forall, but it makes sense when looking at the type of the constructor function (but not when doing pattern matching). Now we can attempt a couple of cases of the type checker:

They look quite nice, and they actually work. So what about something more complicated, like arithmetic?

typeCheckExp (UApp op [a, b]) | Just dop
First we conveniently look up the operator among the arithmetic operators, then we recursively call the type checker for the operands. We do this in the Maybe monad. If the type checking a subterm fails that's automatically propagated, and furthermore, if the type checking of a subterm does not yield a TTDbl type then this will cause the pattern matching to fail, and this will generate a Nothing in the maybe monad, so we used failing pattern matching to our advantage here.
The interesting case is checking UIf, because here both arms have to have the same type, but we don't know which one. Here's an attempt:

But this doesn't type check. The guard ensures that the two arms have the same type, but that's something we know, but the Haskell type checker doesn't. So it rejects the TIf, because it can't see that both arms have the same type. We need to be trickier in doing the equality test so that it reflects the equality on the type level. There's a standard trick for this, namely this type:

data Equal a b where
Eq :: Equal a a

If you ever have a value (which must be Eq) of type Equal foo bar then the type checker will know that foo and bar are actually the same type. So let's code equality for TTyp.

This code is worth pondering for a while, it's actually rather clever (I take no credit for it; I stole it from Tim Sheard). Why does even the first clause type check? Because TTBol has type TTyp Bool, so both the type variables (a and b) must be Bool in the first clause, which means that Eq :: Equal Bool Bool is what we're returning.
Equipped with this equality we can try type checking again.

And amazingly this actually works! (A tribute to the hard working ghc implementors.)
One (rather large) fly is left in the ointment. What about variables? What do we do when we type check UVar? We must check that there's a bound variable with the right type around. So the type checker needs to be extended with an environment where variables can be looked up. It's mostly straight forward. The environment simply maps a variable to ATExp. So here's the complete type checker as it's actually defined.

Note the ULet case which extends the environment. First we type check the expression that's being bound, and then add a variable to the environment and type check the body.
Finally we need to type check the top level:

The class Type allows us to construct the TTyp corresponding to a Haskell type via overloading. Using this and the test function we can then extract a TFun at any type we like. If we try to extract at the wrong type we'll just get Nothing and at the right type we get Just.

The Compiler module

Now all we need to do is to write a function translate that translates a TFun a into the corresponding a. Naturally, using LLVM.
Let's start with some simple cases in translating literals to LLVM code.

The valueOf function is simply the one that lifts a Haskell value into an LLVM value. Note how nice the GADT works out here and we handle both Double and Bool with any need to compromise type safety.
What about arithmetic? Equally easy.

(The && and || are not short circuiting in this implementation. It would be easy to change.)
It's rather amazing that despite these different branches producing and consuming different types it all works out. It's perfectly type safe and free from coercions. This is the beauty of GADTs.
Oh, yeah, mkIf. It's just a piece of mess to create some basic blocks, test, and jump.

Exactly as for the type checking environment we stick the code generation in an environment, and use castAValue project it out of the existential container. The fromJust call in the TVar case cannot fail on type checked code, but with my string based variable representation I have no evidence of this in the TExp so there's actually a cast in the variable case that can fail if scope and type checking has not been performed. The compileExp' is placate the type checker and help it with some evidence about that we are only binding base values.
The rest of the code generation module is just house keeping. It's a little ugly, but not terrible.

The verbosity and large number of type functions in this section has convinced me that I need to simplify some of the types and classes involved in the LLVM code generation.
To convert and LLVM module we call the JIT. This produces a function that returns a value in the IO monad (to be on the safe side) so we need to get rid of the IO, and finally we can get rid of the top level IO, because externally what we are doing is really pure (in some sense).

Conclusion

And that concludes the three parts of the compiler. In about 400 lines of code we can compile a small subset of Haskell expressions to (efficient) machine code. After type checking the rest of the processing is done in a type safe manner (except for a cast in TVar) which is the intention of the high level LLVM interface.
Oh, and if you instrument the code generator a little you can peek at the machine code being produced. For instance, for this input to compile