An Introduction to Compiler Design - Part I - Lexical Analysis

Posted 20 December 2011 - 05:28 PM

POPULAR

An Introduction to Compiler Design - Part I - Lexical Analysis

A. Requirements and Resources

Knowledge in the following topics will be helpful: procedural programming, regular expressions, recursion, finite-state machines, simple data structures, and assembly language. You don't have to be completely acquainted with these topics to get a feel for what's going on. Most of what's discussed here would be covered in an undergraduate course in compiler theory. At the very least, you as a programmer, can get an understanding of how regular expressions are implemented, how to generate assembler code from its high-level equivalents, and how applied graph theory can be used to optimize the runtime of your programs. You'll certainly get some direction as to how to build a compiler from scratch with minimal effort. As an undergraduate, graduate, or prospective computer science student, you'll see why it's necessary that you take the typical core classes in your program and how they'll support you in the latter years of your education if you decide to take a course in compiler theory. Compiler theory has by far been the one most rewarding and intriguing subjects I've studied. It's one step in bridging the gap of knowing how computation is done on all levels; the other is computer architecture.

A few decades ago, programmers imagined the day they didn't have to write cryptic code that interfaced so closely with hardware. That day came and the notion of a “high-level language” was born. A high-level language is one that is human readable. It has abstracted (hid) the details of the internal workings of the computer away from the programmer. Examples of HLL's are C/C++/C#, Java, Python, PHP, Ruby, VB.NET/Basic, Python, Scheme, and the list goes on.

An HLL can be implemented in two ways: via interpretation and via compilation. Interpretation is beyond the scope of this tutorial, but there are many principles in Compiler Theory that apply to interpretation. I can't think of how many times I've heard the question: “How can I design my own language?” You build a compiler. A compiler is a software tool responsible for transforming code written in a HLL to code in a low-level language, usually referred to as “assembly.”

Designing a compiler can be a daunting task, but it doesn't have to be. Rome wasn't built in a day, and neither was C++. Many simple compilers are built for proprietary low-level languages that are used in embedded software, for example. In this tutorial, I will walk you through the six phases of compiler design: lexical analysis, parsing (or “syntactical analysis”), syntax-directed translation, semantic analysis, optimization, and code generation. The first three phases compose what is known as the "front end" of the compiler and the last two compose what is known as the "back end." All example code snippets will be done in Java and all concepts will be explained using an object-oriented approach.

C. Lexical analysis

The word "lexical" is related to the term “lexeme”: an abstract unit of morphological analysis in linguistics. Lexical analysis or "tokenization" is the process of breaking a character stream into individual units called “tokens.” Tokens are strings that represent a category of symbols. For those of you familiar with Java, you know of the legacy class, StringTokenizer, that is used with a regular expression to break up a string into units returned in a String array. A “lexer” or "scanner" is a software tool used by the compiler to tokenize source code. Tokenization is realized through pattern matching. Lets look at an example. Suppose I have the following class declaration:

public class Integer { int value; }

The “parser,” another software tool used by the compiler, contains an instance of a lexer that exposes a method nextToken. Whenever the parser invokes this method, the lexer returns the next token in the source code to the parser. What's done with the token is left to the parser. Lexing is usually a sub-phase of parsing. Consecutive calls to nextToken produce the following sequence of return values:

PUBLIC
CLASS
ID(“Integer”)
LCURLY
INT
ID(“value”)
SEMICOLON
RCURLY

That's the full responsibility of the lexer. The lexer's pattern matching engine uses a set of regular expressions to match a set of known tokens, so it's a little more complicated than just splitting the text on white space. String literals wouldn't be processed correctly since they can contain white space.

- Lexer Generators: Do it the Easy Way!

A ”lexer generator” or "lexer compiler" generates a lexer according to some specification. Several lexer generators exist, two are GNU Flex (C/C++) and JLex (Java). Both are very easy to use; both require a specification file that contains regular expression rules. A rule consists of a regular expression to determine which token to attempt to match and an action which creates a class object Symbol, to be returned from calls to nextToken. Symbol represents an abstract category of all tokens. Encompassed in each Symbol returned is an instance that represents the category of token matched. That instance may be TokenVal (all keyword matches return this), IdTokenVal, StrTokenVal, or IntTokenVal. Lets take a look at the contents of a JLex specification file.

This is how the identifier token is represented as a class object. Just to reiterate, an identifier is the name of a variable, class, or method. In the code above, l is the line number and c is the character number at which the identifier is located; idVal is the actual identifier value. You can see how compilers are intelligent enough to tell you where syntax errors occur.

ID = [_a-zA-Z][a-zA-Z0-9_]*

This is the Perl-style regular expression that's fed to the pattern matching engine to match an identifier. Just like in Java, identifiers can start with underscores, contain digits, and mixes of upper and lower case letters. Matching tokens like { and public don't require complex regular expressions, but they do require actions. Identifiers, string literals, integer literals, and white space require more complex regular expressions because they can vary greatly.

This is the java code that runs when a match is made. Yytext returns the actual string token. S is the Symbol instance returned to the parser after a call to nextToken. Notice that Symbol encompasses IdTokenVal. While calling nextToken, the parser is only concerned with the type of token that nextToken returns. It uses that type to conduct syntactical analysis, which will be explained in part II. In this action the type is sym.ID. If you'll notice above, ID(“Integer”) is apart of the tokenized collection. The instance of IdTokenVal becomes important later when the parse tree is processed and the symbol table is built. The parser will stop collecting tokens when it sees sym.EOF.

JLex will take this specification file and generate a Java class that contains the exact classes and actions declared in the specification and the rest of the lexing logic necessary for it to work. It will also plug your regular expressions into the appropriate places. Much of the specification is the same logic repeated over and over. Learning how to write a specification takes a fraction of the time it would take to learn and code a complex lexer implementation.

At this point in the tutorial, you're now equipped with enough knowledge to create a fully functioning lexer for any language. Here's a snippet of source code from a Java-like (procedural) language called SIMPLE that is tokenized from the the JLex specification snippets I showed you.

Let me start off by noting that JLex is probably build atop Java's regular expression engine. But, since I told you that this tutorial will give you pointers on designing a compiler from scratch, we need to cover all bases. So how would one go about implementing a pattern matcher? Finite-state automata or “finite-state machines” can be used to model any computer program or (logic circuit). An FSM consists of states and transitions. It operates over an input, which in this case, is a character stream. Each character is examined, and it determines the next state that the FSM enters. At the end of the input, the FSM is either in an accept or non-accept state. By the way, FSMs are a restricted type of Turing machine.

For our purposes, if the FSM is in an accept state, we've matched a known token, otherwise the match failed. There are several accept states, each of which correspond to a known token. Below is an FSM (designed in JFLAP) that matches the keywords public, class, static and an identifier created from one or more instances of the characters a, b, and c. The yellow circles represent states. The white arrow points to the start state, and the concentric circles represent final states. The lines connecting states are transitions. Notice the input characters above the transition lines.

The circular portion of the FSM was generated via the regular expression (a+b+c)(a+b+c)*, a JFLAP-specific expression. You can test the FSM with input to make sure the pattern matcher is correct.

- FSM Implementation with Transition Tables

Lastly, we have to implement the FSM. It can be represented by a transition table, which lends itself nicely to a 2d array. Each state is mapped to another state that it can transition to based on the input character. The following table is a partial transition table generated from the FSM above.

Unfortunately, non-deterministic finite-state machines (the above FSM is an NFA) aren't easily translated into transition tables; however, deterministic finite-state machines are easily translated. An NFA can be transformed into a DFA by applying the Subset Construction algorithm. An NFA is an FSM whose states can transition on lambda, the “empty string,” and it's transitions aren't uniquely identified. So, you can transition to two different states on the same input symbol. Very confusing, I know! You can see all the lambda transitions in the circular portion of the FSM above. A DFA has one and only one transition from each state for a particular input and no transition on lambdas. The portions of the FSM above that match public, static, and class are DFAs. Earlier, I mentioned that JFLAP generated the NFA; alternatively, I can use Thompson's Algorithm to generate an NFA from a regular expression. Once you have the DFA, generating the transition table is a piece of cake. Note that Subset Construction and Thompson's algorithm can be implemented programmatically operating on node-based directed graphs.

Here's an NFA for something more common: the regular expression a( a|b )*a. This was constructed using Thompson's algorithm. It's not intuitive in the least bit.

Spoiler

And, here's the corresponding DFA, constructed with Subset Construction. Now this is understandable.

Spoiler

The transition table is represented as an adjacency matrix in the code, where the matrix is indexed by states and input characters. I can then walk through the states, examining the input character-by-character until I reach a point where I can no longer transition. That point is some character that the current state has no transition for. If I end in a final state, I return the token associated with that state to the caller of nextToken, otherwise, I try to match another token (that is, process another transition table).

Note that there are multiple transition tables. You have one for identifiers, one for integer literals, one for string literals, and one for white space. You also have a large FSM and corresponding transition table for matching all of the language's keywords. You could add states and transitions to the very first FSM displayed above for the tokens: static, void, String, int, private, ...

D. Part I Conclusion

Just to recap the second half of part I, we begin with a regular expression to indicate a token that can be matched. The expression (a Java string) is turned into an NFA (a graph data structure) using Thompson's algorithm (a coded algorithm). It then turns the NFA into a DFA using Subset Construction (which is also a coded algorithm). And finally, it turns the DFA into a 2d array representing the transition table. The transition table is then run through while processing input characters to see if we end in a final state, which indicates a match has occurred for that token.

In practice, issues arise that are not handled so easily: minimizing the number of states in the DFA, implementing look-ahead, and developing efficient data structures. In Subset Construction, an NFA with n nodes can produce a DFA with 2n nodes. That's very inefficient, but normally acceptable for relatively small NFAs. Lexers run in linear time where n is the length of the input string.

Pattern matching is everywhere. It's used in search-and-replace functions in text editors. It's used in the ever-so-useful grep utility. It's even been used in microchip production to find imperfections in printed circuits (that is really cool!). And of course, we as programmers, use it every day to make our software work. Its applications are limitless.