Sunday, April 24, 2016

The title of this post is a play on the Lisp aphorism: "Code is Data". In the Lisp world everything is data; code is just another data structure that you can manipulate and transform.

However, you can also go to the exact opposite extreme: "Data is Code"! You can make everything into code and implement data structures in terms of code.

You might wonder what that even means: how can you write any code if you don't have any primitive data structures to operate on? Fascinatingly, Alonzo Church discovered a long time ago that if you have the ability to define functions you have a complete programming language. "Church encoding" is the technique named after his insight that you could transform data structures into functions.

This post is partly a Church encoding tutorial and partly an announcement for my newly released annah compiler which implements the Church encoding of data types. Many of the examples in this post are valid annah code that you can play with. Also, to be totally pedantic annah implements Boehm-Berarducci encoding which you can think of as the typed version of Church encoding.

This post assumes that you have basic familiarity with lambda expressions. If you do not, you can read the first chapter (freely available) of the Haskell Programming from First Principles which does an excellent job of teaching lambda calculus.

If you would like to follow along with these examples, you can download and install annah by following these steps:

Lambda calculus

In the untyped lambda calculus, you only have lambda expressions at your disposal and nothing else. For example, here is how you encode the identity function:

λx → x

That's a function that takes one argument and returns the same argument as its result.

We call this "abstraction" when we introduce a variable using the Greek lambda symbol and we call the variable that we introduce a "bound variable". We can then use that "bound variable" anywhere within the "body" of the lambda expression.

Any expression that begins with a lambda is an anonymous function which we can apply to another expression. For example, we can apply the the identity function to itself like this:

(λx → x) (λy → y)
-- β-reduction= λy → y

We call this "application" when we supply an argument to an anonymous function.

We can define a function of multiple arguments by nested "abstractions":

λx → λy → x

The above code is an anonymous function that returns an anonymous function. For example, if you apply the outermost anonymous function to a value, you get a new function:

(λx → λy → x) 1-- β-reduce
λy →1

... and if you apply the lambda expression to two values, you return the first value:

(λx → λy → x) 12-- β-reduce
(λy →1) 2-- β-reduce1

So our lambda expression behaves like a function of two arguments, even though it's really a function of one argument that returns a new function of one argument. We call this "currying" when we simulate functions of multiple arguments using functions one argument. We will use this trick because we will be programming in a lambda calculus that only supports functions of one argument.

Typed lambda calculus

In the typed lambda calculus you have to specify the types of all function arguments, so you have to write something like this:

λ(x : a) → x

... where a is the type of the bound variable named x.

However, the above function is still not valid because we haven't specified what the type a is. In theory, we could specify a type like Int:

λ(x :Int) → x

... but the premise of this post was that we could program without relying on any built-in data types so Int is out of the question for this experiment.

Fortunately, some typed variations of lambda calculus (most notably: "System F") let you introduce the type named a as yet another function argument:

λ(a :*) → λ(x : a) → x

This is called "type abstraction". Here the * is the "type of types" and is a universal constant that is always in scope, so we can always introduce new types as function arguments this way.

The above function is the "polymorphic identity function", meaning that this is the typed version of the identity function that still preserves the ability to operate on any type.

If we had built-in types like Int we could apply our polymorphic function to the type just like any other argument, giving back an identity function for a specific type:

(λ(a :*) → λ(x : a) → x) Int-- β-reduction
λ(x :Int) → x

This is called "type application" or (more commonly) "specialization". A "polymorphic" function is a function that takes a type as a function argument and we "specialize" a polymorphic function by applying the function to a specific type argument.

However, we are forgoing built-in types like Int, so what other types do we have at our disposal?

Well, every lambda expression has a corresponding type. For example, the type of our polymorphic identity function is:

∀(a :*) →∀(x : a) → a

You can read the type as saying:

this is a function of two arguments, one argument per "forall" (∀) symbol

the first argument is named a and a is a type

the second argument is named x and the type of x is a

the result of the function must be a value of type a

This type uniquely determines the function's implementation. To be totally pedantic, there is exactly one implementation up to extensional equality of functions. Since this function has to work for any possible type a there is only one way to implement the function. We must return x as the result, since x is the only value available of type a.

Passing around types as values and function arguments might seem a bit strange to most programmers since most languages either:

they use a different syntax for type abstraction/application versus ordinary abstraction and application

Example: Scala

-- The polymorphic identity function in Scala
def id[A](x : a)
-- Example use of the function
-- Note: Scala lets you omit the `[Boolean]` here thanks
-- to type inference but I'm making the type
-- application explicit just to illustrate that
-- the syntax is different from normal function
-- application
id[Boolean](true)

For the purpose of this post we will program with explicit type abstraction and type application so that there is no magic or hidden machinery.

So, for example, suppose that we wanted to apply the typed, polymorphic identity function to itself. The untyped version was this:

So we can still apply the identity function to itself, but it's much more verbose. Languages with type inference automate this sort of tedious work for you while still giving you the safety guarantees of types. For example, in Haskell you would just write:

(\x -> x) (\y -> y)

... and the compiler would figure out all the type abstractions and type applications for you.

Exercise: Haskell provides a const function defined like this:

const :: a -> b -> a
const x y = x

Translate const function to a typed and polymorphic lambda expression in System F (i.e. using explicit type abstractions)

Boolean values

Lambda expressions are the "code", so now we need to create "data" from "code".

One of the simplest pieces of data is a boolean value, which we can encode using typed lambda expressions. For example, here is how you implement the value True:

λ(Bool:*) → λ(True:Bool) → λ(False:Bool) →True

Note that the names have no significance at all. I could have equally well written the expression as:

λ(a :*) → λ(x : a) → λ(y : a) → x

... which is "α-equivalent" to the previous version (i.e. equivalent up to renaming of variables).

We will save the above expression to a file named ./True in our current directory. We'll see why shortly.

We are saving these terms and types to files because we can use the annah compiler to work with any lambda expression or type saved as a file. For example, I can use the annah compiler to verify that the file ./True has type ./Bool:

If the expression type-checks then annah will just compile the expression to lambda calculus (by removing the unnecessary type annotation in this case) and return a zero exit code. However, if the expression does not type-check:

annah does not evaluate the expression. annah only translates the expression into Morte code (and the expression is already valid Morte code) and type-checks the expression. If you want to evaluate the expression you need to run the expression through the morte compiler, too:

The above sequence of steps is a white lie: the true order of steps is actually different, but equivalent.

The ./if function was not even necessary because every value of type ./Bool is already a "pre-formed if expression". That's why ./if is just the identity function on ./Bools. You can delete the ./if from the above example and the code will still work.

We started with nothing but lambda expressions, but still managed to implement:

a ./Bool type

a ./True value of type ./Bool

a ./False value of type ./Bool

./if, ./not, ./and, and ./or functions

... and we can do real computation with them! In other words, we've modeled boolean data types entirely as code.

Exercise: Implement an xor function

Natural numbers

You might wonder what other data types you can implement in terms of lambda calculus. Fortunately, you don't have to wonder because the annah compiler will actually compile data type definitions to lambda expressions for you.

For example, suppose we want to define a natural number type encoded using Peano numerals. We can write:

All Boehm-Berarducci-encoded datatypes are encoded as substitution functions, including ./Nat. Any value of ./Nat is a function that takes three arguments that we will substitute into our natural number expression:

The first argument replace every occurrence of the Nat type

The second argument replaces every occurrence of the Succ constructor

The third argument replaces every occurrence of the Zero constructor

This will make more sense if we walk through a specific example. First, we will build the number 3 using the ./Nat/Succ and ./Nat/Zero constructors:

Now suppose that we want to compute whether or not our natural number is even. The only catch is that we must limit ourselves to substitution when computing even. We have to figure out something that we can substitute in place of the Succ constructors and something that we can substitute in place of the Zero constructors that will then evaluate to ./True if the natural number is even and ./False otherwise.

One substitution that works is the following:

Replace every Zero with ./True (because Zero is even)

Replace every Succ with ./not (because Succ alternates between even and odd)

So in other words, if we began with this:

./Nat/Succ (./Nat/Succ (./Nat/Succ ./Nat/Zero ))

... and we substitute with ./Nat/Succ with ./not and substitute ./Nat/Zero with ./True:

Now the next two arguments have exactly the right type for us to substitute in ./not and ./True. The argument named ./Succ is now a function of type ∀(pred : ./Bool ) → ./Bool, which is the same type as ./not. The argument named Zero is now a value of type ./Bool, which is the same type as ./True. This means that we can proceed with the next two arguments:

So we can encode natural numbers in lambda calculus, albeit very inefficiently! There are some tricks that we can use to greatly speed up both the time complexity and constant factors, but it will never be competitive with machine arithmetic. This is more of a proof of concept that you can model arithmetic purely in code.

Exercise: Implement a function which multiplies two natural numbers

Data types

annah also lets you define "temporary" data types that scope over a given expression. In fact, that's how Nat was implemented. You can look at the corresponding *.annah files to see how each type and term is defined in annah before conversion to morte code.

The first four lines are identical to what we wrote when we invoked the annah types command from the command line. We can use the exact same data type specification to create a scoped expression that can reference the type and data constructors we specified.

You can use these scoped datatype declarations to quickly check how various datatypes are encoded without polluting your current working directory. For example, I can ask annah how the type Maybe is encoded in lambda calculus:

A Maybe value is just another substitution function. You provide one branch that you substitute for Just and another branch that you substitute for Nothing. For example, the Just constructor always substitutes in the first branch and ignores the Nothing branch that you supply:

... but that doesn't really tell us anything about how annah desugars let because we only see the final evaluated result. We can ask annah to desugar without performing any other transformations using the annah desugar command:

Also, every expression has a corresponding *.annah file that documents the expression's type using a let expression. For example, we can see the type of the ./List/(++) function by studying the ./List/(++).annah file:

Conclusions

A lot of people underestimate how much you can do in a total lambda calculus. I don't recommend pure lambda calculus as a general programming language, but I could see a lambda calculus enriched with high-efficiency primitives to be a realistic starting point for simple functional languages that are easy to port and distribute.

One of the projects I'm working towards in the long run is a "JSON for code" and annah is one step along the way towards that goal. annah will likely not be that language, but I still factored out annah as a separate and reusable project along the way so that others could fork and experiment with annah when experimenting with their own language design.

Also, as far as I can tell annah is the only project in the wild that actually implements the Boehm-Berarducci encoding outlined in this paper:

The Annah tutorial goes into the Annah language and compiler in much more detail than this tutorial, so if you would like to learn more I highly recommend reading the tutorial which walks through the compiler, desugaring, and the Prelude in much more detail: