Chapter 3 Syntax Analysis

Similar presentations

2 Syntax AnalysisSyntax analysis recognizes the syntactic structure of the programming language and transforms a string of tokens into a tree of tokens and syntactic categoriesParser is the program that performs syntax analysis

5 Syntax TreesA syntax tree represents the syntactic structure of tokens in a program defined by the grammar of the programming language:=+id1id2*id360

6 Context-Free Grammars (CFG)A set of terminals: basic symbols (token types) from which strings are formedA set of nonterminals: syntactic categories each of which denotes a set of stringsA set of productions: rules specifying how the terminals and nonterminals can be combined to form stringsThe start symbol: a distinguished nonterminal that denotes the whole language

9 DerivationsA derivation step is an application of a production as a rewriting rule, namely, replacing a nonterminal in the string by one of its right-hand sides, N   … N …  …  …Starting with the start symbol, a sequence of derivation steps is called a derivation S  …   or S * 

11 Left- & Right-Most DerivationsIf there are more than one nonterminal in the string, many choices are possibleA leftmost derivation always chooses the leftmost nonterminal to rewriteA rightmost derivation always chooses the rightmost nonterminal to rewrite

13 Parse TreesA parse tree is a graphical representation for a derivation that filters out the order of choosing nonterminals for rewritingMany derivations may correspond to the same parse tree, but every parse tree has associated with it a unique leftmost and a unique rightmost derivation

15 Ambiguous GrammarsA grammar is ambiguous if it can derive a string with two different parse treesIf we use the syntactic structure of a parse tree to interpret the meaning of the string, the two parse trees have different meaningsSince compilers do use parse trees to derive meaning, we would prefer to have unambiguous grammars

19 End-Of-File and Bottom-of-Stack MarkersParsers must read not only terminal symbols but also the end-of-file marker and the bottom-of-stack makerWe will use $ to represent the end of file markerWe will also use $ to represent the bottom-of-stack maker

21 CFG versus REEvery language defined by a RE can also be defined by a CFGWhy use REs for lexical syntax?do not need a notation as powerful as CFGsare more concise and easier to understand than CFGsMore efficient lexical analyzers can be constructed from REs than from CFGsProvide a way for modularizing the front end into two manageable-sized components

22 Nonregular LanguagesREs can denote only a fixed number of repetitions or an unspecified number of repetitions of one given construct an, a*A nonregular language: L = {anbn | n  0} S  a S b S  

23 Top-Down ParsingConstruct a parse tree from the root to the leaves using leftmost derivation S  c A B A  a b input: cad A  a B  dScABScABabScABaScABad

24 Predictive ParsingPredictive parsing is a top-down parsing without backtrackingNamely, according to the next token, there is only one production to choose at each derivation step stmt  if expr then stmt else stmt | while expr do stmt | begin stmt_list end

25 LL(k) Parsing Predictive parsing is also called LL(k) parsingThe first L stands for scanning the input from left to rightThe second L stands for producing a leftmost derivationThe k stands for using k lookahead input symbol to choose alternative productions at each derivation step

27 Recursive Descent ParsingA procedure is associated with each nonterminal of the grammarAn alternative case in the procedure is associated with each production of that nonterminalA match of a token is associated with each terminal in the right hand side of the productionA procedure call is associated with each nonterminal in the right hand side of the production

34 First and Follow SetsThe first set of a string , FIRST(), is the set of terminals that can begin the strings derived from . If  *  , then  is also in FIRST()The follow set of a nonterminal X, FOLLOW(X), is the set of terminals that can immediately follow X

35 Computing First Sets If X is terminal, then FIRST(X) is {X}If X is nonterminal and X   is a production, then add  to FIRST(X)If X is nonterminal and X  Y1 Y2 ... Yk is a production, then add a to FIRST(X) if for some i, a is in FIRST(Yi) and  is in all of FIRST(Y1), ..., FIRST(Yi-1). If  is in FIRST(Yj) for all j, then add  to FIRST(X)

37 Computing Follow SetsPlace $ in FOLLOW(S), where S is the start symbol and $ is the end-of-file markerIf there is a production A   B , then everything in FIRST() except for  is placed in FOLLOW(B)If there is a production A   B or A   B where FIRST() contains  , then everything in FOLLOW(A) is in FOLLOW(B)

39 Table-Driven Predictive ParsingInput. Grammar G. Output. Parsing Table M.Method.1. For each production A   of the grammar, do steps 2 and 3.2. For each terminal a in FIRST( ), add A   to M[A, a].3. If  is in FIRST( ), add A   to M[A, b] for eachterminal b in FOLLOW(A). If  is in FIRST( ) and $ is inFOLLOW(A), add A   to M[A, $].4. Make each undefined entry of M be error.

44 Left Recursive GrammarsA grammar is left recursive if it has a nonterminal A such that A * A Left recursive grammars are not LL(1) because A  A  A   will cause FIRST(A )  FIRST()  We can transform them into LL(1) by eliminating left recursion

49 Left factoringA grammar is not LL(1) if two productions of a nonterminal A have a nontrivial common prefix. For example, if    , and A   1 |  2, then FIRST( 1)  FIRST( 2)  We can transform them into LL(1) by performing left factoring A   A' A'  1 | 2

52 LR(k) Parsing The L stands for scanning the input from left to rightThe R stands for producing a rightmost derivationThe k stands for using k lookahead input symbol to choose alternative productions at each derivation step

55 LL(k) versus LR(k)LL(k) parsing must predict which production to use after seeing only the first k tokens of the right-hand sideLR(k) parsing is able to postpone the decision until it has seen tokens corresponding to the entire right-hand side and k more tokens beyondLR(k) parsing thus can handle more grammars than LL(k) parsing

69 Functions yyparse(): the parser functionyylex(): the lexical analyzer function. Bison recognizes any non-positive value as indicating the end of the input

70 Variablesyylval: the attribute value of a token. Its default type is int, and can be declared to be multiple types in the first section using %union { int ival; double dval; }Tokens with attribute value can be declared as %token <ival> intcon %token <dval> doublecon

71 Conflict ResolutionsA reduce/reduce conflict is resolved by choosing the production listed firstA shift/reduce conflict is resolved in favor of shiftA mechanism for assigning precedences and assocoativities to terminals

72 Precedence and AssociativityThe precedence and associativity of operators are declared simultaneously %nonassoc ‘<’ /* lowest */ %left ‘+’ ‘-’ %right ‘^’ /* highest */The precedence of a rule is determined by the precedence of its rightmost terminalThe precedence of a rule can be modified by adding %prec <terminal> to its right end

77 From CFG to NPDAAn LR(0) item of a grammar in G is a production of G with a dot at some position of the right-hand side, A    The production A  X Y Z yields the following four LR(0) items A  • X Y Z, A  X • Y Z, A  X Y • Z, A  X Y Z •An LR(0) item represents a state in a NPDA indicating how much of a production we have seen at a given point in the parsing process

80 From NPDA to DPDAThere are two functions performed on sets of LR(0) items (states)The function closure(I) adds more items to I when there is a dot to the left of a nonterminalThe function goto(I, X) moves the dot past the symbol X in all items in I that contain X

81 The Closure Function closure(I) = repeat for any item A   X  in Ifor any production X   I = I  { X    }until I does not changereturn I

85 The Subset Construction Functionsubset-construction(cfg) =initialize T to {closure({S’   S})}repeatfor each state I in T and each symbol Xlet J be goto(I, X)if J is not empty and not in T thenT = T  { J }until T does not changereturn T

90 SLR(1) Parsing Table GenerationSLR(cfg) =for each state I in subset-construction(cfg)if A   a  in I and goto(I, a) = J for a terminal a thenaction[I, a] = “shift J”if A   in I and A  S’ then action[I, a] = “reduce A ” for all a in Follow(A)if S’  S  in I then action[I, $] = “accept”if A   X  in I and goto(I, X) = J for a nonterminal Xthen goto[I, X] = Jall other entries in action and goto are made error

93 LR(I) ItemsAn LR(1) item of a grammar in G is a pair, ( A    , a ), of an LR(0) item A     and a lookahead symbol aThe lookahead has no effect in an LR(1) item of the form ( A    , a ), where  is not An LR(1) item of the form ( A    , a ) calls for a reduction by A   only if the next input symbol is a

98 The Subset Construction Functionsubset-construction(cfg) =initialize T to {closure({(S’   S , $)})}repeatfor each state I in T and each symbol Xlet J be goto(I, X)if J is not empty and not in T thenT = T  { J }until T does not changereturn T