Description: Translation and implementation of programming languages. Parsing, code generation, and optimization. Compiler design projects for simple block-structured programming languages are used to illustrate the concepts and methods.

Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves and working up towards the root. In other words, it is a process of “reducing” (opposite of deriving a symbol using a production rule) a string w to the start symbol of a grammar. At every (reduction) step, a particular substring matching the RHS of a production rule is replaced by the symbol on the LHS of the production.

A general form of shift-reduce parsing is LR (scanning from Left to right and using Right-most derivation in reverse) parsing, which is used in a number of automatic parser generators like Yacc, Bison, etc.

Here are the definitions of some commonly used terminologies in this context.

Handles

A “handle” of a string is a substring that matches the RHS of a production and whose reduction to the non-terminal (on the LHS of the production) represents one step along the reverse of a rightmost derivation toward reducing to the start symbol.

If S →* αAw →* αβw, then A → β in the position following α is a handle of αβw.

In such a case, it is suffice to say that the substring β is a handle of αβw, if the position of β and the corresponding production are clear.

Consider the following grammar:

E → E + E | E * E | (E) | id

and a right-most derivation is as follows:

E → E + E → E+ E * E → E + E * id3 → E + id2 * id3 → id1 + id2 * id3

The id’s are subscripted for notational convenience.

Note that the reduction is in the opposite direction from id1 + id2 * id3 back to E, where the handle at every step is underlined.

Implementation of Shift-Reduce Parsing

A convenient way to implement a shift-reduce parser is to use a stack to hold grammar symbols and an input buffer to hold the string w to be parsed. The symbol $ is used to mark the bottom of the stack and also the right-end of the input.

Notationally, the top of the stack is identified through a separator symbol |, and the input string to be parsed appears on the right side of |. The stack content appears on the left of |.

For example, an intermediate stage of parsing can be shown as follows:

$id1 | + id2 * id3$ …. (1)

Here “$id1″ is in the stack, while the input yet to be seen is “+ id2 * id3$*

In shift-reduce parser, there are two fundamental operations: shift and reduce.

Shift operation: The next input symbol is shifted onto the top of the stack.

After shifting + into the stack, the above state captured in (1) would change into:

$id1 + | id2 * id3$

Reduce operation: Replaces a set of grammar symbols on the top of the stack with the LHS of a production rule.

After reducing id1 using E → id, the state (1) would change into:

$E | + id2 * id3$

Viable Prefixes

The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called viable prefixes. It is always possible to add terminal symbols to the end of a viable prefix to obtain a right-sentential form.

Item

An item X → β.γ is valid for a viable prefix αβ if

S’ →* αXw → αβγw

is obtainable by a rightmost derivation. After parsing αβ, the valid items are the possible tops of the stack of items.

Notations for Possible Actions in LR Parsing Table

sn: Shift into state n

gn: Goto state n

rk: Reduce by rule k

a: Accept

<Nothing> : Error

Understanding Parsing using Examples

In every example, we introduce a new start symbol (S’), and define a new production from this new start symbol to the original start symbol of the grammar.

Consider the following grammar (putting an explicit end-marker $ at the end of the first production):

(1) S’ → S$

(2) S → Sa

(3) S → b

For this example, the NFA for the stack can be shown as follows:

After doing ε-closure, the resulting DFA is as follows:

The states of DFA are also called “Canonical Collection of Items”. Using the above notation, the ACTION-GOTO table can be shown as follows:

Notice that there are two entries for state 2 on input ‘a’. So, this cannot be LR(0) grammar. So, the next enhancement is to look into the FOLLOW set of S’ (which is {$}) and it is different from the shift symbol, this shift-reduce conflict can be resolved easily by reducing on input if it belongs to the FOLLOW set for S’. This enhancement to parsing is called SLR(1) parsing. The corresponding table can be written as follows:

Since there is no duplicate entry in the ACTION-GOTO table, this grammar is an SLR(1) grammar.

Now, let’s look into progressive parsing of input string baa$.

First, let’s consider the case when the stack only contains the grammar symbols. The DFA at every step (shown in each entry of the following table) would start from the bottom of the stack (identified with a ‘$’).

For the DFA to process from the bottom of the stack at every step is quite wasteful. So, it makes sense to save the grammar symbol along with the current state into the stack. With this change, each stack entry (i..e, the LHS of |) is a pair of grammar symbol and state.

One step further than SLR(1) parsing is LR(1) parsing where the look-ahead is built into each item in the DFA. A slightly less powerful than LR(1), but more space-optimized than LR(1) is LALR(1), which essentially combines the redundant states of LR(1) DFA into one state.

Exercises

Draw the DFA and ACTION-GOTO tables for the following grammars, which require progressively more powerful parsing.

FIRST(X) for all grammar symbols X

If X is a non-terminal, and X → Y1 Y2 … Yk is a production, and ε is in all of FIRST(Y1), …, FIRST(Yk), then add ε to FIRST(X).

If X is a non-terminal, and X → Y1 Y2 … Yk is a production, then add a to FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1), …, FIRST(Yi-1).

Applying rules 1 and 2 is obvious. Applying rules 3 and 4 for FIRST(Y1 Y2 … Yk) can be done as follows:

Add all the non-ε symbols of FIRST(Y1) to FIRST(Y1 Y2 … Yk). If ε ∈ FIRST(Y1), add all the non-ε symbols of FIRST(Y2). If ε ∈ FIRST(Y1) and ε ∈ FIRST(Y2), add all the non-ε symbols of FIRST(Y3), and so on. Finally, add ε to FIRST(Y1 Y2 … Yk) if ε ∈ FIRST(Yi), for all 1 ≤ i ≤ k.

Example:

Consider the following grammar.

E → E + T | T

T → T * F | F

F → (E) | id

Grammar after removing left recursion:

E → TX

X → +TX | ε

T → FY

Y → *FY | ε

F → (E) | id

For the above grammar, following the above rules, the FIRST sets could be computed as follows:

FIRST(E) = FIRST(T) = FIRST(F) = {(, id}

FIRST(X) = {+, ε}

FIRST(Y) = {*, ε}

FOLLOW(A) for all non-terminals A

Apply the following rules:

If $ is the input end-marker, and S is the start symbol, $ ∈ FOLLOW(S).

If there is a production, A → αBβ, then (FIRST(β) – ε) ⊆ FOLLOW(B).

If there is a production, A → αB, or a production A → αBβ, where ε ∈ FIRST(β), then FOLLOW(A) ⊆ FOLLOW(B).

Note that unlike the computation of FIRST sets for non-terminals, where the focus is on what a non-terminal generates, the computation of FOLLOW sets depends upon where the non-terminal appears on the RHS of a production.

Example:

For the above grammar, the FOLLOW sets can be computed by applying the above rules as follows.

FOLLOW(E) = {$, )}

FOLLOW(E) ⊆ FOLLOW(X) [in other words, FOLLOW(X) contains FOLLOW(E)]

Since there is no other rule applicable to FOLLOW(X),

FOLLOW(X) = {$, )}

FOLLOW(T) ⊆ FOLLOW(Y) …. (1)

(FIRST(X) – ε) ⊆ FOLLOW(T) i.e., {+} ⊆ FOLLOW(T) …. (2)

Also, since ε ∈ FIRST(X), FOLLOW(E) ⊆ FOLLOW(T)

i.e., {$, )} ⊆ FOLLOW(T) …. (3)

Putting (2) and (3) together, we get:

FOLLOW(T) = {$, ), +}

Since, there is no other rule applying to FOLLOW(Y), from (1), we get:

[An excerpt from flex manual]GENERATING C++ SCANNERS
flex provides two different ways to generate scanners for use with C++. The first way is to simply compile a scanner generated by flex using a C++ compiler instead of a C compiler. You should not encounter any compilations errors (please report any you find to
the email address given in the Author section below). You can then use C++ code in your rule actions instead of C code. Note that the default input source for your scanner remains yyin, and default echoing is still done to yyout. Both of these remain FILE * variables and not C++ streams.

You can also use flex to generate a C++ scanner class, using the -+ option (or, equivalently, %option c++), which is automatically specified if the name of the flex executable ends in a ‘+’, such as flex++. When using this option, flex defaults to generating the scanner to the file lex.yy.cc instead of lex.yy.c. The generated scanner includes the header file FlexLexer.h, which defines the interface to two C++ classes.

The first class, FlexLexer, provides an abstract base class defining the general scanner class interface. It provides the following member functions:

const char* YYText()
returns the text of the most recently matched token, the equivalent of yytext.

int YYLeng()
returns the length of the most recently matched token, the equivalent of yyleng.

int lineno() const
returns the current input line number (see %option yylineno), or 1 if %option yylineno was not used.

void set_debug( int flag )
sets the debugging flag for the scanner, equivalent to assigning to yy_flex_debug . Note that you must build the scanner using %option debug to include debugging information in it.

int debug() const
returns the current setting of the debugging flag.

The second class defined in FlexLexer.h is yyFlexLexer, which is derived from FlexLexer. It defines the following additional member functions:

yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
constructs a yyFlexLexer object using the given streams for input and output. If not specified, the streams default to cin and cout, respectively.

virtual int yylex()
performs the same role is yylex() does for ordinary flex scanners: it scans the input stream, consuming tokens, until a rule’s action returns a value. If you derive a subclass S from yyFlexLexer and want to access the member functions and variables of S inside yylex(), then you need to use %option yyclass=”S” to inform flex that you will be using that subclass instead of yyFlexLexer. In this case, rather than generating yyFlexLexer::yylex(), flex generates S::yylex() (and also generates a dummy yyFlexLexer::yylex() that calls yyFlexLexer::LexerError() if called).

In addition, yyFlexLexer defines the following protected virtual functions which you can redefine in derived classes to tailor the scanner:

virtual int LexerInput( char* buf, int max_size )
reads up to max_size characters into buf and returns the number of characters read. To indicate end-of-input, return 0 characters. Note that “interactive” scanners (see the -B and -I flags) define the macro YY_INTERACTIVE. If you redefine LexerInput() and need to take different actions depending on whether or not the scanner might be scanning an interactive input source, you can test for the presence of this name via #ifdef.

virtual void LexerOutput( const char* buf, int size )
writes out size characters from the buffer buf, which, while NUL-terminated, may also contain “internal” NUL’s if the scanner’s rules can match text with NUL’s in them.

virtual void LexerError( const char* msg )
reports a fatal error message. The default version of this function writes the message to the stream cerr and exits.

Note that a yyFlexLexer object contains its entire scanning state. Thus you can use such objects to create reentrant scanners. You can instantiate multiple instances of the same yyFlexLexer class, and you can also combine multiple C++ scanner classes together in the same program using the -P option discussed above.

Finally, note that the %array feature is not available to C++ scanner classes; you must use %pointer (the default).

SPIM is a self-contained simulator that runs MIPS32 programs. It reads and executes assembly language programs written for this processor. SPIM also provides a simple debugger and minimal set of operating system services. SPIM does not execute binary (compiled) programs.

Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It is going to be very useful for you to visualize the compiler intermediate representations.