Lexical Analysis

Lexical analysis is the process of analyzing a stream of individual characters (normally arranged as lines), into a sequence of lexical tokens (tokenization. for instance of "words" and punctuation symbols that make up source code) to feed into the parser. Roughly the equivalent of splitting ordinary text written in a natural language (e.g. English) into a sequence of words and punctuation symbols. Lexical analysis is often done with tools such as lex, flex and jflex.

Strictly speaking, tokenization may be handled by the parser. The reason why we tend to bother with tokenising in practice is that it makes the parser simpler, and decouples it from the character encoding used for the source code.

What is a token

In computing, a token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.

Consider the following table:

lexeme

token

sum

IDENT

=

ASSIGN_OP

3

NUMBER

+

ADD_OP

2

NUMBER

;

SEMICOLON

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyzer reads in a stream of lexemes and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation : "46 - number_of(cows);". The lexemes here might be: "46", "-", "number_of", "(", "cows", ")", and ";". The lexical analyzer will denote lexemes 4 and 6 as 'number' and - as character, and 'number_of ' as a separate token. Even the lexeme ';' in some languages (such as C) has some special meaning.

The whitespace lexemes are sometimes ignored later by the syntax analyzer. A token doesn't need to be valid, in order to be recognized as a token. "cows" may be nonsense to the language, "number_of" may be nonsense. But they are tokens nonetheless, in this example.

Finite State Automaton

We first study what is called a finite state automaton, or FSA for short. An FSA is usually used to do lexical analysis.

An FSA consists of states, starting state, accept state and transition table. The automaton reads an input symbol and moves the state accordingly. If the FSA reaches the accept state after the input string is read until its end, the string is said to be accepted or recognized. A set of recognized strings is said to be a language recognized by the FSA.

Suppose this language in which each string starts with 'ab' and ends with one or more 'c'. With state class, this can be written like this:

From the above EBNF, The tokens we're about to recognize are: IDENTIFIER, INTEGER, keyword "print", keyword "var", :=, +, -, <, <>, EOF and unknown token. The chosen tokens are intended for both brevity and the ability to recognize all types of token's lexeme: exact single character (+, -, <, EOF and unknown), exact multiple character (print, var, := and <>), infinitely many (IDENTIFIER, INTEGER and unknown), overlapping prefix (< and <>) and overlapping as a whole (IDENTIFIER and keywords). Identifier and keywords here are case-insensitive. Note that some lexemes are classified to more than one type of lexeme.

The Token Class

The base class of a token is simply an object containing line and column number where it's declared. From this base class, tokens with exact lexeme (either single or multiple characters) could be implemented as direct descendants.

A lexer consists of its position in the source code (line and column), the stream representing the source code, current (or last recognized / formed) token and last character read. To encapsulate the movement in the source code, reading character from the stream is implemented in GetChar method. Despite its maybe simple look, the implementation could be complicated as we'll see soon. GetChar is used by public method NextToken, whose job is to advance lexer movement by 1 token ahead. On to GetChar implementation:

As stated earlier, GetChar's job is not as simple as its name. First, it has to read one character from the input stream and increment the lexer's column position. Then it has to check whether this character is one of the possible line endings (our lexer is capable of handling CR-, LF- and CRLF-style line ending). Next is the core of our lexer, NextToken:

procedureTLexer.NextToken;varStartLine,StartCol:LongWord;functionGetNumber:TIntegerToken;varNumLex:String;beginNumLex:='';repeatNumLex:=NumLex+FLastChar;FLastChar:=GetChar;untilnot(FLastCharin['0'..'9']);Result:=TIntegerToken.Create(StartLine,StartCol,NumLex);end;functionGetIdentifierOrKeyword:TVariadicToken;varIdLex:String;beginIdLex:='';repeatIdLex:=IdLex+FLastChar;// This is how we handle case-insensitivenessFLastChar:=LowerCase(GetChar);untilnot(FLastCharin['a'..'z','0'..'9','_']);// Need to check for keywordscaseIdLexof'print':Result:=TPrintToken.Create(StartLine,StartCol);'var':Result:=TVarToken.Create(StartLine,StartCol);otherwiseResult:=TIdentifier.Create(StartLine,StartCol,IdLex);end;end;begin// Eat whitespaceswhileFLastCharin[#32,#9,#13,#10]doFLastChar:=GetChar;// Save first token position, since GetChar would change FLine and FColStartLine:=FLine;StartCol:=FCol;ifFLastChar=#26thenFCurrentToken:=TEOFToken.Create(StartLine,StartCol)elsecaseLowerCase(FLastChar)of// Exact single character'+':beginFCurrentToken:=TPlusToken.Create(StartLine,StartCol);FLastChar:=GetChar;end;'-':beginFCurrentToken:=TMinusToken.Create(StartLine,StartCol);FLastChar:=GetChar;end;// Exact multiple character':':beginFLastChar:=GetChar;ifFLastChar='='thenFCurrentToken:=TAssignToken.Create(StartLine,StartCol)// ':' alone has no meaning in this language, therefore it's invalidelseFCurrentToken:=TUnknownToken.Create(StartLine,StartCol);end;// Overlapping prefix'<':beginFLastChar:=GetChar;// '<>'ifFLastChar='>'thenFCurrentToken:=TNotEqualToken.Create(StartLine,StartCol)// '<'elseFCurrentToken:=TLessToken.Create(StartLine,StartCol);end;// Infinitely many is handled in its own function to cut down line length here'0'..'9':beginFCurrentToken:=GetNumber;end;'a'..'z','_':beginFCurrentToken:=GetIdentifierOrKeyword;end;elsebeginFCurrentToken:=TUnknownToken.Create(StartLine,StartCol,FLastChar);FLastChar:=GetChar;end;end;end;

As you can see, the core is a (probably big) case statement. The other parts is quite self documenting and well commented. Last but not least, the constructor:

It sets up the initial line and column position (guess why it's 1 for line but 0 for column :)), and also sets up the first token available so CurrentToken would be available after calling the constructor, no need to explicitly call NextToken after that.

Table-Driven Hand-Written Scanning

Compiling Channel Constant

Scanning via a Tool - lex/flex

Scanning via a Tool - JavaCC

JavaCC is the standard Java compiler-compiler. Unlike the other tools presented in this chapter, JavaCC is a parser and a scanner (lexer) generator in one. JavaCC takes just one input file (called the grammar file), which is then used to create both classes for lexical analysis, as well as for the parser.

In JavaCC's terminology the scanner/lexical analyser is called the token manager. And in fact the generated class that contains the token manager is called ParserNameTokenManager. Of course, following the usual Java file name requirements, the class is stored in a file called ParserNameTokenManager.java. The ParserName part is taken from the input file. In addition, JavaCC creates a second class, called ParserNameConstants. That second class, as the name implies, contains definitions of constants, especially token constants. JavaCC also generates a boilerplate class called Token. That one is always the same, and contains the class used to represent tokens. One also gets a class called ParseError. This is an exception which is thrown if something went wrong.

It is possible to instruct JavaCC not to generate the ParserNameTokenManger, and instead provide your own, hand-written, token manager. Usually - this holds for all the tools presented in this chapter - a hand-written scanner/lexical analyser/token manager is much more efficient. So, if you figure out that your generated compiler gets too large, give the generated scanner/lexical analyzer/token manager a good look. Rolling your own token manager is also handy if you need to parse binary data and feed it to the parsing layer.

Since, however, this section is about using JavaCC to generate a token manager, and not about writing one by hand, this is not discussed any further here.

Defining Tokens in the JavaCC Grammar File

A JavaCC grammar file usually starts with code which is relevant for the parser, and not the scanner. For simple grammars files it looks similar to:

This is usually followed by the definitions for tokens. These definitions are the information we are interested in in this chapter. Four different kinds, indicated by four different keywords are understood by JavaCC when it comes to the definition of tokens:

TOKEN

Regular expressions which specify the tokens the token manager should be able to recognize.

SPECIAL_TOKEN

SPECIAL_TOKENs are similar to TOKENS. Only, that the parser ignores them. This is useful to e.g. specify comments, which are supposed to be understood, but have no significance to the parser.

SKIP

Tokens (input data) which is supposed to be completely ignored by the token manager. This is commonly used to ignore whitespace. A SKIP token still breaks up other tokens. E.g. if one skips white space, has a token "else" defined, and if the input is "el se", then the token is not matched.

MORE

This is used for an advanced technique, where a token is gradually built. MORE tokens are put in a buffer until the next TOKEN or SPECIAL_TOKEN matches. Then all data, the accumulated token in the buffer, as well as the last TOKEN or SPECIAL_TOKEN is returned.

One example, where the usage of MORE tokens is useful are constructs where one would like to match for some start string, arbitrary data, and some end string. Comments or string literals in many programming languages match this form. E.g. to match string literals, delimited by ", one would not return the first found " as a token. Instead, one would accumulate more tokens, until the closing " of the string literal is found. Then the complete literal would be returned. See Comment Example for an example where this is used to scan comments.

Each of the above mentioned keywords can be used as often as desired. This makes it possible to group the tokens, e.g. in a list for operators and another list for keywords. All sections of the same kind are merged together as if just one section had been specified.

Every specification of a token consists of the token's symbolic name, and a regular expression. If the regular expression matches, the symbol is returned by the token manager.

All the above token definitions use simple regular expressions, where just constants are matched. It is recommended to study the JavaCC documentation for the full set of possible regular expressions.

When the above file is run through JavaCC, a token manager is generated which understands the above declared tokens.

Comment Example

Eliminating (ignoring) comments in a programming language is a common task for a lexical analyzer. Of course, when JavaCC is used, this task is usually given to the token manager, by specifying special tokens.

Basically, a standard idiom for JavaCC has evolved on how to ignore comments. It combines tokens of kind SPECIAL_TOKEN and MORE with a lexical state. Lets assume we e.g. have a programming language where comments start with a --, and either end with another -- or the end of the line (this is the comment schema of ASN.1 and several other languages). Then a way to construct a scanner for this would be: