Introduction

to be a great programmer you must learn and write a compiler because it will make any algorithm easier for you no matter how difficult it seems to you.

writing a compiler is interesting because it teaches you what a compiler is.

I will show the front end of a compiler and how to generate a syntax tree represented as an array and generate the three-address-code .

Background :

Compiler is a program that reads a program written in a source language and translate it into an equivalent program In a target language .

Compiler consists of several phases each of which passes its output to the next phase.

Design of compiler are often partitioned into Front End that deals only with language specific issues and Back End that deals only with machine-specific issues .

In this project I discuss only the Front End.

I implemented a Predictive Parser to generate a syntax tree and an intermediate code ( three-address-code ) for assignment statement .

assignment statement is an identifier = expression , Ex: x = a + 5 or a = -x / y and so on

1 – Lexical Analysis phase ( scanner )

Lexical analyzer groups characters into lexical units or tokens ( identifier, number,..etc.) The input to the lexical phase is a character stream and the output is a stream of tokens .the white spaces ( blank , tap , new line ) are scanned out .

The scanner can be implemented as a finite state machine

This Lexical Analyzer is used as a method called by the parser to return the next token When an identifier in the input is detected by the lexical analyzer , the identifier is entered into a data structure called symbol table. The symbol table contains a record for each identifier with fields for the attribute of the identifier . the Lexical Analyzer look up for the identifier in the symbol table and enter it only if this identifier is not found inside the symbol table .

Implementation of Lexical Analyzer ( scanner )

TokenUnit nextToken()

This is the lexical analyzer method it return the token to the caller ( parser ) .

2- Syntax Analysis phase ( Parser )

The parser recognizes a sentence in terms of grammar to check if it is grammatically well formed or not. It groups tokens into a syntactical units. The output of the parser is a parse tree representation of the program, nodes of parse tree are constructed through using grammar, grammar is a set of rules which govern the interdependencies and structure among the tokens.

Context-free grammars are used to define the program structure recognized by a parser, context-free-grammar are a formalization of recursive rules that can be used to guide the syntax analysis .

This grammar is specified using ( production ) the grammar for assignment statement must be eliminate the grammar defects such as ambiguity and left recursion and left factoring .

Grammar for assignment statement :

S --> id = E

E --> T E'

E' --> + T E' | - T E' | ε

T --> F T'

T' --> * F T' | / F T' | ε

F --> id | num | - F | ( E )

Components of context-free-grammar :

Production : "left side --> right side "

Non-Terminals are { S , E , E', T, T', F }

Terminals are { id , = , + , - , * , / , num , ( , ) ,ε }

Start symbol : a designation of one of the non-terminal , S is the start symbol .

Derivations : sequence of replacement for the given input according to the grammar

for example : a = b * - c

derivate into :

S --> id =E

E -->T E'

T --> F T'

F -->id

T' --> * F T'

F --> - F

F --> id

T' --> ε

E' --> ε

Parse tree :

A parse tree shows the start symbol of a grammar derives a sentence in the input .

Syntax tree:

It is a compressed representation of the parse tree in which the operators appear as the nodes, and the operands of an operator are the children of the node , each node is represented as a record with field for its operator and additional fields for its children, nodes are allocated from an array of records and the array index ( position of the node ) serves as the pointer to the node , all the node in the syntax tree can be visited by following pointers , starting from the root at the last position .

The node structure is consist of label and left pointer and right pointer such as a binary operation , some nodes such as unary minus operation has only one pointer , when constructing a leaf , the node label is set to represent the token of that leaf such as identifier or number .

To construct the syntax tree we must add the semantic action into the grammars to represent the syntax tree as array of records, The parser traverses the parse tree depth-first and constructing the syntax tree during parsing for a successive statements according to the grammar , the grammars contains semantic action that execute at the end of a grammar , this grammars called syntax-directed-definition , semantic actions used to write the required output .

The parser traverses the parse tree depth-first and constructing the syntax tree during parsing for a successive statements according to the syntax-directed-definition .

Note at the beginning , this grammar must be add and subtract to value so we must create node that has a value zero and in the multiply and division we must create another node that has a value one .

The syntax tree represented as array of records

syntax tree build bottom-up

3- The semantic analysis phase

The semantic analysis phase analyzes the parse tree for context-sensitive information often called the static semantics, type checking is very important in the semantic analysis, the output of the semantic analysis phase is an annotated parse tree (augmented with semantic actions).

4- Intermediate code Phase .

Three address code is similar to assembly code. It's a linear representation of syntax tree .

The three-address code for the last example is :

temp0=0

temp1=1

temp2=b

temp3=c

temp4=-temp3

temp5=temp1*temp4

temp6=temp2*temp5

temp7=temp6+temp0

Now for each non-terminal we create a method for it and for each terminal we use the match() method the function of this method is checks tokens , it reads the next input token if the look-ahead symbol is matched it get the next token and calls the error() function otherwise .

Note that tokens represented by integer numbers called the token-code.

Remember this parser generate both syntax tree and three-address-code .