cucu: a compiler you can understand (part 1)

Let talk about the compilers. Have you ever thought of writing your own one?

I will try to show you how simple it is. The first part will be pretty much
theoretical, so keep patience.

what we’re going to achieve?

CUCU is a toy compiler for a toy language. I want it to be as close to ANSI C
as possible, so that every valid CUCU program could be compiled with a C
compiler without any errors. Of course, the support of the whole ANSI C
standard is very difficult, so I picked a very small C language subset.

Let’s try to write it down in EBNF form (it’s absolutely ok, if you don’t
know what EBNF is, it’s really intuitive):

<program> ::= { <var-decl> | <func-decl> | <func-def> } ;

This notation says: “a program is a repeating sequence of variable declarations,
function declarations and function definitions. But what is all those
declarations and definitions? Ok, let’s go deeper:

Expression is a smaller part of the statement. As opposed to statements,
expressions always return a value. Usually, it’s just the arithmetic. For
example in the statement func(x[2], i + j) the expressions are x[2] and
i+j.

So, looking back at <func-body>. It’s just a valid statement (which is
usually a block statement).

That’s it. Did you notice the recursion in the expression notation? Basically,
the notation above shows us the precedence of the operators from bottom to top:
parens and square brackets are evaluated first, and assignment goes the last.

For example, according to this grammar an expression 8>>1+1
will be evaluated to 2 (like in 8>>(1+1)), not to 5 (like in (8>>1)+1),
because >> has lower precedence than +.

lexer

Now we are done with grammar, and are ready to start. The first thing to do is
a lexer. Our compiler takes a stream of bytes as an input, and lexer allows to
split that stream into smaller tokens, that can be processed later. It gives us
some level of abstraction and simplifies out parser.

For example, a sequence of bytes “int i = 2+31;” will be split into tokens:

int
i
=
2
+
31
;

In normal grown-up lexers every lexeme (token) has a type and a value, so
instead of the list above we will get a list of pairs: <TYPE:int>,<ID:i>,
<ASSIGN:=>,<NUM:2>,<PLUS:+>,<NUM:31>,<SEMI:;>. We are going to detect lexeme
type by its value, which is not academical at all!

The major problem with the lexer is that once a byte is read from the stream -
it can not be “un-read”. And what if we’ve read a byte that can not be added to
the current token? Where should we store it until the current token is
processed?

Almost every lexer need to read ahead. Our grammar is simple enough, so all we
need to have is a single byte buffer - nextc. It stores a byte, that was read
from the stream, but has not yet been pushed to the token string.

Also, I need to warn you - I use global variables a lot in CUCU code. I know
it’s a bad style, but if I pass all values as function arguments the compiler
would loose it’s simplicity.

The whole lexer is just a single function readtok(). The algorithm is simple:

skip leading spaces

try to read an identifier (a sequence of letters, digits and _)

if it’s not an identifier - try to read a sequence of special operators, like
&, |, <, >, =, !.

if it’s not an operator - try to read a string literal “….” or ‘….’

if failed - maybe it’s a comment, like /* ... */?

if failed again - just read a single byte. It might be another single-byte
token, like “(” or “[“.