Lexical Analysis

Lexical Analyzer in Perspective

token parser get next token

symbol tableImportant Issue:What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser.

Role of the Lexical Analyzer

Identify the words: Lexical Analysis Converts a stream of characters (input program) into a stream of tokens. Also called Scanning or Tokenizing

Identify the sentences: Parsing.

Derive the structure of sentences: construct parse trees from a stream of tokens.Next_char() Next_token()

Input

character

Scanner

token

Parser

Symbol Table

Interaction of Lexical Analyzer with Parser

Next_char() Next_token()

Inputcharacter

Scanner

token

Parser

Symbol Table

Often a subroutine of the parser Secondary tasks of Lexical Analyzer

Strip out comments and white spaces from the source Correlate error messages with the source program Preprocessing may be implemented as lexical analysis takes place

Issues in lexical analysis

Simplicity/Modularity: Conventions about words" are often different from conventions about sentences". Efficiency: Word identification problem has a much more efficient solution than sentence identification problem. Specialized buffering techniques for reading input characters and processing tokens.

Portability: Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical analyzer.

Introducing Basic Terminology

What are Major Terms for Lexical Analysis? TOKEN

A classification for a common set of strings

e.g., tok_integer_constant PATTERN

The rules which characterize the set of strings for a token

e.g., digit followed by zero or more digits

LEXEME

Actual sequence of characters that matches pattern and is classified by a token

e.g., 32894

Introducing Basic Terminology

Tokenconst if relation id if <, <=, =, < >, >, >= pi, count, D2

Sample Lexemesconst

Informal Description of Pattern

const if < or <= or = or < > or >= or > letter followed by letters and digits

Attributes for Tokens

Tokens influence parsing decision; the attributes influence the translation of tokens. A token usually has only a single attribute A pointer to the symbol-table entry Other attributes (e.g. line number, lexeme) can be stored in symbol table

<id, pointer to symbol-table entry for E>

Lexical Error Few errors can be caught by the lexical analyzer Most errors tend to be typos Not noticed by the programmer return 1.23; retunn 1,23;

... Still results in sequence of legal tokens

<ID, retunn> <INT,1> <COMMA> <INT,23> <SEMICOLON>

No lexical error, but problems during parsing! Another example: fi (a == f(x))

In what Situations do Errors Occur?

Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of remaining input.

Recovery from lexical errors

Panic mode recovery Delete successive characters from the input until the lexical analyzer can find a well-formed token May confuse the parser The parser will detect syntax errors and get straightened out (hopefully!)

Other possible error-recovery actions

Deleting an extra character Inserting a missing character Replacing an incorrect character by a correct character Swapping two adjacent character

Solution: Use a paired buffer of N characters each

N-characters N-characters

.. ..

C * * 2 \0

Deficiency:Code:

forward lexeme_beginning

if (forward at end of buffer1) then relaod buffer2; forward = forward +1; else if (forward at end of buffer2) then relaod first half; move forward to the beginning of the buffer1 else forward = forward +1

Sentinels Technique: Use Sentinels to reduce testing Choose some character that occurs rarely in most inputs e.g. \0N-characters N-characters

.. ..

M * \0

C * * 2 \0

\0

forward lexeme_beginningforward++; if *forward == \0 then if forward at end of buffer #1 then Read next N bytes into buffer #2; forward = address of first char of buffer #2; elseIf forward at end of buffer #2 then Read next N bytes into buffer #1; forward = address of first char of buffer #1; else // do nothing; a real \0 occurs in the input endIf endIf