Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I’m going to write a compiler for a simple language. The compiler will be written in C#, and will have multiple back ends. The first back end will compile the source code to C, and use cl.exe (the Visual C++ compiler) to produce an executable binary.

Therefore, I’m not going to call this thing a “series”. I might be able to write another post on the same subject, or I might not; if I don’t, I’ll post the whole lump of source code here and have you decide if it’s worth continuing on your own.

With that said, let’s start by introducing the language for which we’ll write a compiler. It’s called Jack, and I haven’t made it up—it’s a teaching language used in the book The Elements of Computer Systems by Noam Nissan and Shimon Shocken, with some minor modifications I introduced. The language is designed to make lexical analysis, parsing, and code generation as easy as possible. (Indeed, the HUJI course From NAND to Tetris covers compiler construction in two lessons, and students complete a working Jack compiler—to an intermediate VM representation—in slightly less than three weeks.)

C:\JackCompiler>out 13 is prime. 14 is composite. 41 is prime. 97 is prime. 101 is prime.

Assuming that we’re not interested in extraneous formalism, we can go ahead and think about the first part of the compiler—the lexical analyzer, or the tokenizer. The structure of a compiler is well-illustrated by the following diagram [source]:

Before we attach semantic meaning to the language constructs, we have to get away with such details as skipping unnecessary whitespace, recognizing legal identifiers, separating symbols from keywords, and so on. This is the purpose of the lexical analyzer, which takes an input stream of characters and generates from it a stream of tokens, elements that can be processed by the parser. Sometimes the parser constructs a parse tree (abstract syntax tree) or any other intermediate representation of the source code; at other times, the parser directly instructs the compiler back-end (or code generator) to synthesize the executable program.

Normally, you wouldn’t write the lexical analyzer by hand. Instead, you provide a tool such as flex with a list of regular expressions and rules, and obtain from it a working program capable of generating tokens. For example, the following regular expression recognizes all legal Jack identifiers:

[_A-Za-z][_A-Za-z0-9]*

However, for didactic reasons, we will be rolling by hand our own lexical analyzer. It’s not a very challenging task, too—dealing with comments and extraneous whitespace is probably the hardest part.

The following is the primary method of our lexical analyzer. (The rest of its implementation was omitted for brevity.)

There are five interesting cases here from which five different token types can be generated:

A symbol [TokenType.Symbol], which may contain two characters—explaining the need for additional look-ahead with ‘<’, ‘>’, and ‘!’. Note that the additional look-ahead may fail if the symbol is placed at the end of the file, but this is not a legal language construct, anyway.

A numeric constant [TokenType.IntConst]—we currently allow only integer constants, as the language doesn’t have floating-point support.

A character ordinal constant such as ‘H’ or ‘\032’—these are translated to numeric constants as in #2.

A literal string constant [TokenType.StrConst] such as “Hello World”—note that ‘”’ is not a legal character within a literal string constant. We leave it for now as a language limitation.

A keyword or an identifier [TokenType.Keyword or TokenType.Ident], matching the previously shown regular expression.

This lexical analyzer is rather “dumb”—it does not record identifier information anywhere, and it doesn’t provide access to anything but the current token. It turns out that we don’t need anything else for the current Jack syntax—formally speaking, it is almost an LL(1) language, i.e. most of its language constructs can be parsed with only one look-ahead token. The single LL(2) exception is subroutine calls within expressions, and we’ll craft a special case in the parser to work around this limitation.

For the “Hello World” program above, this lexical analyzer will produce the following sequence of tokens: