super secret hq

Parsing McCarthy's S-Expressions

September 30, 2012

This article builds a top-down recursive parser for simple s-expressions. S-expressions are used to represent data and code in LISP. We’re not going to build a full-blown LISP interpreter here, just a parser for s-expressions.

I’m basing this grammar on the LISP 1.5 Programmer’s Manual written by McCarthy, Abrahams, Edwards, Hart & Levin. It accepts both s-expressions and list notation. A s-expression is the basic form in LISP. It consists of atoms and dotted pairs. Here are some examples:

A
(A . B)
(A . (B . C))

List notation simplifies that by removing the dots and extra parentheses. The manual also describes a comma separated version of list notation (eg, (A, B, C)), but we will not be doing that.

A
(A B)
(A B C)

To simplify some of the code in the parser, we’re going to assume that the input always has a token signifying the start and end of the input. We’re going to call these tokens “BOF” and “EOF.” The grammar with those tokens becomes:

The Lexer

Our lexer assumes that the input is passed as a single buffer. This simplifies the lexer because we don’t have to worry about tokens spanning a buffer boundary. Of course it costs more in memory because we have to hold the whole buffer in memory.

For the moment, we’re treating symbols as text. Later, when we add a symbol table, we’ll fix this logic. The function checks for the valid tokens for an atom, number or text. If either is found, then the parser state is modified by advancing to the next token and we return the value 1 to indicate that we did find an atom. If neither is found, then we return a zero to indicate that.

Right now the parser only recognizes valid s-expressions. Later, we will add code to the placeholders to build a structure for the s-expressions. We will use that structure when building the interpreter.