Monday, March 22, 2010

Learning Clojure by writing a (very) minimal Lisp interpreter

I wanted to learn a bit of Clojure, but I only had a couple of Emacs Lisp notions.. So I thought it would be fun to try writing a mini Lisp interpreter, in Clojure (using Emacs, to add a level of self-reference!) and document the process. But first please consider (1) that I tried to not consult any book or website about parsing or interpreting Lisp for this exercise (other than my memories of some vague CS notions), and (2) that I was not, and have not become a Lisp expert in any way, so it is pretty obvious that this will contain blatant errors, lack of stylistic taste, grossly non-optimized algorithms and wrong (or abuse of) Lisp terminology.. Please read it only for the sake of the learning process of someone new to Lisp/CLojure, but very enthusiastic about it.
I'll be using clojure-contrib version 1.2 (from the current master branch), since its "string" API provides some very useful functions (that seem to not be available in the 1.1 version):

(require '[clojure.contrib.string :as str])

If we assume that we will be processing a mini Lisp source file as one big string, one first thing we can do is stripping the comments from it, using a simple regular expression:

(defnstrip-comments [s]
(str/join "" (str/split #";.*\n?" s)))

We can then tokenize the input string, by giving the partition function (from the "string" API) a regular expression that chunks the input file content using parentheses and white space separators (while not breaking string literals, which may contain such characters). Note that one important feature of the partition function (that we will make use of) is that it retains the separators in the result list, which is why we need to trim and filter it, in order to remove the unwanted empty strings:

Being done with the preprocessing, we are now ready to parse our list of tokens, that is, build a syntax tree representation of the input program, in which every atom is a leaf, and every list is a subtree. Thanks to the particular Lisp syntax, this is relatively easy.. although I must admit that I didn't find the solution as easily as this sounds:

We now need to extract symbols from this syntax tree: functions, defined with the defun special form (remember this is a Lisp interpreter, not a Clojure one), and variables, defined with setq (setf would be much harder to implement I think, and is thus outside the scope of this simple interpreter). Each symbol will be stored in a Clojure map, and will point to a Clojure vector of two elements in the case of variables (a "var" identifier, and the variable node, as found in the syntax tree) and three elements in the case of functions (a "function" identifier, the list of function parameters, and a list of the function nodes, as found in the syntax tree). All this is accomplished with this function:

When it reaches a function node, that is, a node that should do something, this is where we inject semantics (admittedly, quite a simple one) into our interpreter, that was concerned until now only with the "what" to compute, and not with the "how" to do it. On a side note, I would say that this is the place where I reached the "relevance limit" of this learning project: in particular, lists are being implemented.. with lists! I guess that if I would have gone the hardcore way, I would have implemented my own lists from lower-level components, but to keep things simple, I figured this would make an acceptable place to stop (obviously a much sillier place to stop would have been to simply "eval" the whole thing in the first place!).. Anyway, here is the deeper function:

For user-defined functions, i.e. symbols that can be found in our previously extracted symbol table, we require an extra step. We need to "instantiate" the function, that is, replace all the formal parameter symbols in its definition by their actual value when the function is called. This is first accomplished by creating a map that associates every symbol to their value (using the "zipmap" function), and then pass it to this function, that will do the rest of the job:

One thing I really wanted to be able to do, and was a bit anxious about because I didn't dare to "execute it in my head" before trying it.. was recursion, which I was happily surprised to see handled gracefully: