3' Parsing - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

3' Parsing

Description:

12/21/09. 1. 3. Parsing. Syntax: the way in which words are put together to form ... for each production X Y1 Y2 Yk. for each i from 1 to k, each j from i 1 to k, ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 75
Provided by: cseTt
Category:
Tags: courses | parsing

less

Transcript and Presenter's Notes

Title: 3' Parsing


1
3. Parsing
  • Syntax the way in which words are put together
    to form phrases, clauses, or sentences.

2
R.E.s Are Not Sufficient
  • The form 283019 can be defined by the R.E.
  • digits0-9
  • sum(digits)digits
  • The expressions (10923),61,(1(2503)) can be
    defined by
  • digits0-9
  • sumexprexpr
  • expr(sum)digits
  • But it is impossible for an FA to recognize
    balanced parentheses!

3
R.E.s Are Not Sufficient
  • The additional expressive power gained by
    recursion is just what we need for parsing.
  • What we left is a very simple notation, called
    context-free grammars(CFGs).
  • Just as REs can be used to define lexical
    structure in a static, declarative way, CFGs
    define syntactic structure declaratively.
  • But will need something more powerful than FA to
    parser languages described by grammars.

4
Context-Free Grammars
  • A language is a set of strings each string is a
    finite sequence of symbols taken from a finite
    alphabet.
  • For parsing,
  • the strings are source programs,
  • the symbols are lexical tokens, and
  • the alphabet is the set of token types returned
    by the lexical analyzer.

5
Context-Free Grammars
  • A context-free grammar describes a language.
  • A grammar has a set of production of the form
  • symbol ? symbol symbol ? symbol
  • Each symbol is either
  • a terminal, or
  • a nonterminal
  • A nonterminal is distinguished as the start
    symbol of the grammar.

6
Context-Free Grammars
Terminal symbols id print num , ( )
An example of grammar 1 S?SS 2 S?idE 3
S?print(L) 4 E?id 5 E?num 6 E?EE 7 E?(S,E) 8
L?E 9 L?L,E
Nonterminal symbols S E L
One sentence in the language of the
grammar idnum idid(idnumnum,id)
The source text might have been a7 bc(d5
6,d)
7
Context-Free Grammars
  • Derivation
  • start with the start symbol, then repeatedly
    replace any nonterminal by one of its RHS.

S SS SidE idEidE idnumidE idnumid
EE idnumidE(S,E) idnumidid(S,E) id
numidid(idE,E) idnumidid(idEE,E) i
dnumidid(idEE,id) idnumidid(idnum
E,id) idnumidid(idnumnum,id)
1 S?SS 2 S?idE 3 S?print(L) 4 E?id 5 E?num 6
E?EE 7 E?(S,E) 8 L?E 9 L?L,E
8
Context-Free Grammars
  • There are many different derivations of the same
    sentence.
  • A leftmost derivation is one in which the
    leftmost nonterminal symbol is always the one
    expanded
  • in a rightmost derivation, the rightmost
    nonterminal is always next to be expanded.
  • The previous derivation is neither leftmost or
    rightmost.

9
Parse Trees
  • A parse tree is made by connecting each symbol in
    a derivation to the one from which is was derived.

Two different derivations can have the same
parse tree.
10
Ambiguous Grammars
  • A grammar is ambiguous if it can derive a
    sentence with two different parse trees.

E?id E?num E?EE E?E/E E?EE E?E-E E?(E)
Ambiguous !
11
Ambiguous Grammars
  • Ambiguous grammar are problematic for compiling.
  • Let us find an unambiguos grammar that accepts
    the same language as the previous grammar
  • binds tighter than , or has higher precedence
  • each operator associates to the left.

12
Ambiguous Grammars
This grammar can never produce the following
parse trees
E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?(E)
?X
?U

?Y
?V



13
End-Of-File Marker
  • Parsers must read not only terminal symbols, but
    also the end-of-file marker.
  • To indicate that (end-of-file) must come after a
    complete S-phrase, we augment the grammar with a
    new start symbol S and a new production S? S

14
Predictive Parsing
A recursive-descent parser
Final int IF1, THEN2, ELSE3,
BEGIN5, END5, PRINT6, SEMI7, NUM8,
EQ9 int tokgetToken() void advance()
tokgetToken() void eat(int t)if (tokt)
advance() else error() void
S() switch(tok) case IF eat(IF)E()eat(THEN)
S() eat(ELSE)S()break case BEGIN
eat(BEGIN)S()L()break case PRINT
eat(PRINT)E()break default error() void
L()switch(tok) case END eat(END)break case
SEMI eat(SEMI)S()L()break default
error() void E()eat(NUM)eat(EQ)eat(NUM)
S?if E then S else S S?begin S L S?print
E L?end L? S L E?num num
15
Predictive Parsing
A conflict here
void S() E()eat(EOF) void E()switch(tok) ca
se ? E()eat(PLUS)T()break case ?
E()eat(NIMUS)T()break case ?
T()break default error() void
T()switch(tok) case ? T()eat(TIMES)F()break
case ? T()eat(DIV)F()break case ?
F()break default error()
S?E E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?
(E)
16
Predictive Parsing
  • Recursive-descent, or predictive, parsing works
    only on grammars where the first terminal symbol
    of each subexpression provides enough information
    to choose which production to use.
  • We introduce the notion of FIRST sets to resolve
    the conflict problem.

17
FIRST and FOLLOW Sets
  • Given a string ?of terminal and nonterminal
    symbols, FIRST(?) is the set of all terminal
    symbols that can begin any string derived from ?.
  • For example, let ?TF.
  • Any string of terminal symbols derived from ?
    must start with id, num, or (.
  • Thus FIRST(TF)id, num, (.

18
FIRST and FOLLOW Sets
  • If two different productions X??1 and X??2 have
    the same left-hand side symbol and the right-hand
    sides have overlapping FIRST sets, then the
    grammar cannot be parsed using predictive parsing.

19
FIRST and FOLLOW Sets
  • With respect to a particular grammar, given a
    string ?of terminals and nonterminals
  • nullable(X) is true if X can derive the empty
    string.
  • FIRST(?) is the set of terminals that can begin
    strings derived from ?.
  • FOLLOW(?) is the set of terminals that can
    immediately follow X.

20
FIRST and FOLLOW Sets
  • A precise definition of FIRST, Follow and
    nullable is that they are the smallest sets for
    which these properties hold

For each terminal symbol Z, FIRST(Z)Z for each
production X ?Y1 Y2 ? Yk for each i from 1 to k,
each j from i1 to k, if all the Yi are
nullable then nullableXtrue if Y1 ? Yi-1 are
all nullable then FIRSTXFIRSTX?FIRSTYi if
Yi1 ? Yk are all nullable then
FOLLOWYiFOLLOWYi?FOLLOWX if Yi1 ? Yj-1
are all nullable then FOLLOWYiFOLLOWYi?FOLLOW
Yj
21
FIRST and FOLLOW Sets
  • Algorithm to compute FIRST, FOLLOW and nullable

for each terminal symbol Z FIRST(Z)?Z repeat for
each production X ?Y1 Y2 ? Yk for each i from 1
to k, each j from i1 to k, if all the Yi are
nullable then nullableX ? true if Y1 ? Yi-1 are
all nullable then FIRSTX ? FIRSTX?FIRSTYi if
Yi1 ? Yk are all nullable then FOLLOWYi ?
FOLLOWYi?FOLLOWX if Yi1 ? Yj-1 are all
nullable then FOLLOWYi ? FOLLOWYi?FOLLOWYj u
ntil FIRST, FOLLOW, and nullable did not change
in this iteration.
22
FIRST and FOLLOW Sets
Z?d Z?XYZ Y? Y?c X?Y X?a
23
Constructing a Predictive Parser
  • If we can choose the right production for each
    (X,T), where X is some nonterminal and T is the
    next token of the input, then we can write the
    recursive-descent parser.
  • What we need is a 2D table of productions,
    indexed by nonterminals X and terminals T.
  • This is called a predictive parsing table.

24
Constructing a Predictive Parser
  • To construct a predictive parsing table, enter
    production X ? ? in row X, column T of the table
    for each T?FIRST?.
  • Also, if ?is nullable, enter the production in
    row X, column T for each T?FOLLOW?.

25
Constructing a Predictive Parser
A Predictive parser
Z?d Z?ZYZ Y? Y?c X?Y X?a
The presence of duplicate entries means that
predictive parsing will, not work on the grammar.
26
Constructing a Predictive Parser
  • An ambiguous grammar will always lead to
    duplicate entries in a predictive parsing table.
  • If we need to use the language of the previous
    grammar as a programming language, we will need
    to find an unambiguous grammar.
  • Grammars whose predictive parsing table contain
    no duplicate entries are called LL(1)
    (left-to-right, leftmost-derivation, 1-symbol
    lookahead).

27
Eliminating Left Recursion
  • The two productions
  • E?ET
  • E?T
  • are certain to cause duplicate in the LL(1)
    parsing table.
  • The problem is that E appears as the first RHS
    symbol in an E-production.
  • This is called left-recursion. Grammars with left
    recursion cannot be LL(1).

28
Eliminating Left Recursion
  • To eliminate left recursion, we will rewrite
    using right recursion.

E?TE E? TE E?
E?ET E?T
29
Eliminating Left Recursion
  • In general, whenever we have productions X ? X?
    and X ? a where a does not start with X, we know
    that this derives strings of the form a?, an a
    followed by zero or more ?.

30
Left Factoring
  • We can left-factor the grammar

S?if E then S else S S?if E then S
S?if E then S X X? X?else S
31
Error Recovery
  • pp. 55-56

32
LR Parsing
  • The weakness of LL(k) parsing technique is that
    they must predict which production to use, having
    seen only the first k tokens of the right-hand
    side.
  • A more powerful technique, LR(k) parsing, is able
    to postpone the decision until it has seen input
    tokens corresponding to the entire right-hand
    side of the production in question.

Left-to-right parse, Rightmost-derivation,k-token
lookahead
33
1 1id4 1id46 1id46num10 1id46E11 1S2 1S23 1
S23id4 1S23id46 1S23id46id20 1S23id46E11
1S23id46E1116 1S23id46E1116(8 1S23id46
E1116(8id4 1S23id46E1116(8id46 1S23id46E
1116(8id46num10 1S23id46E1116(8id46E11 1S
23id46E1116(8id46E16 1S23id46E1116(8id4
6E16num10 1S23id46E1116(8id46E16E17 1S2
3id46E1116(8id46E11 1S23id46E1116(8S12 1S
23id46E1116(8S12,18 1S23id46E1116(8S12,18i
d4 1S23id46E1116(8S12,18E21 1S23id46E1116(
8S12,18E21)22 1S23id46E1116E17 1S23id46E11
1S23S5 1S2
a7bc(d56,d) 7bc(d56,d) 7bc
(d56,d) bc(d56,d) bc(d56,d) b
c(d56,d) bc(d56,d) c(d56,d) c
(d56,d) (d56,d) (d56,d) (d56,d)
d56,d) 56,d) 56,d) 6,d) 6,d) 6,d)
,d) ,d) ,d) ,d) d) ) )
1 S?SS 2 S?idE 3 S?print(L) 4 E?id 5 E?num 6
E?EE 7 E?(S,E) 8 L?E 9 L?L,E
34
LR Parsing
  • How are the LP parser knows when to shift and
    when to reduce?
  • By using a DFA!
  • The DFA is not applied to the input finite
    automata are too weak to parse context-free
    grammars but to the stack.
  • The edges of the DFA are labeled by the symbols
    (terminals and nonterminals) that can appear on
    the stack.
  • Transition table for Grammar 3.1.

35
(No Transcript)
36
LR Parsing
  • To use this table in parsing, treat the shift and
    goto actions as edges of a DFA, and scan the
    stack.
  • For example, if the stack is idE, then the DFA
    goes from state 1 to 4 to 6 to 11.
  • If the next input token is a semicolon, then the
    column in state 11 says to reduce by rule 2.
  • ?the top three tokens are popped and S is pushed.

37
LR Parsing
  • Rather than rescan the stack for each token, the
    parser can remember instead the state reached for
    each stack element. Then parsing algorithm is

Look up top stack state, and input symbol, to get
action If action is Shift(n) Advance input
one token push n on stack. Reduce(k) Pop stack
as many items as the number of symbols on the
RHS of rule k Let X be the LHS symbol of rule
k In the state now on top of stack, look up X to
get goto n Push n on top of stack. Accept
Stop parsing, report success. Error Stop
Parsing, report failure.
38
LR(0) Parser Generation
  • An LR(k) parser uses the content of its stack and
    the next k tokens of the input to decide which
    action to take.
  • The previous table shows the use of one symbol of
    lookahead.
  • For k2, the table has column for every two-token
    sequence ,etc in practice, k gt1 is not used for
    compilation, because of
  • huge table, and
  • LR(k) is sufficient for most reasonable PL.

39
LR(0) Parser Generation
  • Initially, the stack is empty, and the input will
    be a complete S-sentence followed by .
  • We indicate this as S?.S, where the dot
    indicates the current position of the parser.
  • In this state, it begins with any possible RHS of
    an S-production.

S ?.(L) An example of LR(0) item
S?S S ?(L) S ?x L ?S L ?L.S
40
LR(0) Parser Generation
  • Shift actions
  • In state 1, if we shift an x, then the top of
    stack will have an x.
  • We indicate that by

2
S ?x.
  • Consider shifting a left parenthesis, the dot is
    moved one position right.

41
LR(0) Parser Generation
  • Goto actions
  • In state 1, consider the parsing past some string
    of tokens derived by the S nonterminal.

42
LR(0) Parser Generation
  • Reduced actions
  • In state 2, a dot is at the end of an item, which
    means that on top of the stack there must be a
    complete RHS of the corresponding production, and
    is ready for reduction.

43
LR(0) Parser Generation
  • The basic operations we have been performing on
    states are closure(I), and goto(I,X), where I is
    a set of items and X is a grammar symbol.
  • Closure(I) adds more items to a set of items when
    there is a dot to the left of a nonterminal
  • goto moves the dot past the symbol X in all items.

44
LR(0) Parser Generation
Closure(I) repeat for any item A??.X? in I
for any production X?? I?I?X?.? until I
does not change. Return I
Goto(I,X) set J to the empty set for any
item A??.X? in I add A??X.? to J return
Closure(J)
45
LR(0) Parser Generation
  • The algorithm for LR(0) parser construction
  • First, augment the grammar with an auxiliary
    start production S?S.
  • Let T be the set of states seen so far, and
  • E the set of (shift or goto) edges found so far.

46
LR(0) Parser Generation
Initialize T to Closure(S?S) Initialize E
to empty. repeat for each state I in T
for each item A??.X? in I let J be
goto(I,X) T?T?J
E?E?I?J until E and T did not change in this
iteration
X
For the symbol we do not compute goto(I,)
instead we will make an accept action.
47
LR(0) Parser Generation
2
x
x
S?x.
S
x
(
(
,
S
L
(
)
S
7
L?S.
LR(0) states for Grammar 3.19.
48
LR(0) Parser Generation
  • Compute set R of LR(0) reduce actions

R? for each state I in T for each item
A??.in I R?R ?(I, A??)
49
LR(0) Parser Generation
  • Construct a parsing table
  • For each I?J, where X is a terminal, we put a
    shift J at position (I,X) of the table if X is a
    nonterminal we put a goto J at position (I,X).
  • For each state I containing an item S?S. we
    put an accept action at (I,).
  • For a state containing an item A??.(production n
    with the dot at the end), we put a reduce n
    action at (I,Y) for every token Y.

X
50
LR(0) Parser Generation
LR(0) parsing table for Grammar 3.19.
51
LR(0) Parser Generation
  • In principle, since LR(0) needs no lookahead, we
    just need a single action for each state a state
    will shift or reduce, but not both.
  • In practice, since we need to know what state to
    shift into, we have rows headed by state numbers
    and columns headed by grammar symbols.

52
SLR Parser Generation
0 S?E 1 E ?ET 2 T ?(E) 3 T ?x
  • In state 3, on symbol , there is a duplicate
    entry.
  • This is a conflict and indicates that the grammar
    is not LR(0).
  • We need a more powerful parsing algorithm.

53
SLR Parser Generation
  • A simple way of constructing better-than-LR(0)
    parser is called SLR.
  • Parser constructed for SLR is almost identical to
    that for LR(0), except that we put reduce action
    into the table only where indicated by the FOLLOW
    set.

R? for each state I in T for each item
A??.in I for each token X in FOLLOW(A)
R?R ?(I, X, A??)
In state I, on lookahead symbol X, the parser
will reduce by rule A??
54
SLR Parser Generation
  • The SLR states and parsing table becomes

1
E
S?.E E?.TE E?.T T?.x
0 S?E 1 E?TE 2 E?T 3 T?x
T

T
4
x
E?T.E E?.TE E?.T T?.x
E
55
LR(1)
  • Even more powerful than SLR is the LR(1) parsing
    algorithm.
  • Most programming languages whose syntax is
    describable by context-free grammar have an LR(1)
    grammar.

56
LR(1)
  • The algorithm for constructing an LR(1) parsing
    table is similar to that for LR(0), but the
    notion of an item is more sophisticated.
  • An LR(1) item consists of a grammar production, a
    right-hand-side position, and a lookahead symbol.
  • The idea is that an item(A??.?,x) indicates that
    the sequence ? is on top of the stack, and at the
    head of the input is a string derivable from ?x.

57
LR(1)
  • An LR(1) state is a set of LR(1) items, and there
    are closure and goto operations for LR(1)

Closure(I) repeat for any item (A??.X?,z) in
I for any production X?? for any
w?FIRST(?z) I?I?(X?.?, w) until I does
not change. return I
Goto(I,X) set J to the empty set for any
item (A??.X?,z) in I add (A??X.?,z) to J
return Closure(J)
58
LR(1)
  • The start state is the closure of the item
    (S?.S,?), where the lookahead will not matter.
  • The reduce actions are chosen by this algorithm

R? for each state I in T for each item
(A??.,z) in I R?R ?(I,z, A??)
59
LR(1)
E
0 S?E 1 E?TE 2 E?T 3 T?x
3
E?T.E E?T.
T

T
4
x
E?T.E E?.TE E?.T T?.x T?.x
x
E
60
LALR(1)
  • LR(1) parsing table can be very large, with many
    states.
  • A smaller table can be made by merging two states
    whose items are identical except for lookahead
    sets.
  • For example, the items in state 5 of the previous
    LR(1) parser are identical if the lookahead sets
    are ignored.
  • The result parser is called an LALR(1) parser.
  • Merging states 5 and 7 gives the parsing table
    the same as the SLR one, but this is not always
    the case.

61
Hierarchy of Grammar Classes
  • Fig. 3.26, p. 67.

62
LR Parsing of Ambiguous Grammars
  • Many programming languages have grammar rules
    such as
  • S?if E then S else S
  • S?if E then S
  • For example, the statement,
  • if a then if b then s1 else s2
  • could be understood in two ways
  • (1) if a then if b then s1 else s2
  • (2) if a then if b then s1 else s2
  • In the LR parsing table, there will be a
    shift-reduce conflict

S?if E then S. else S?if E then S.
else S (any)
63
LR Parsing of Ambiguous Grammars
  • It is often possible to use ambiguous grammar by
    resolving SR conflict in favor of S or R, as
    appropriate.
  • But it is best to use this technique sparingly,
    and only in cases that are well understood.

64
Using Parser Generators
  • CUP (Construction of Useful Parsers) is a LR
    parser-generator tool, modeled on the classic
    Yacc (Yet another compiler-compiler).
  • A CUP specification has a preamble, which
    declares lists of terminal symbols, nonterminals,
    etc.
  • The preamble also specifies how the parser is to
    be attached to a lexical analyzer and other such
    details.

65
Using Parser Generators
  • The grammar rules are productions of the form
  • exp exp PLUS exp semantic actions
  • where exp is a nonterminal producing an RHS of
    expexp, and PLUS is a terminal symbol (token).
  • The semantic action is written in ordinary Java
    and will be executed whenever the parser reduces
    using this rule.

66
Using Parser Generators
terminal ID, WHILE, BEGIN, END, DO, IF,
THEN, ELSE, SEMI, ASSIGN non terminal
prog, stm, stmlist start with prog prog
stmlist stm IS ASSIGN ID WHILE ID DO
stm BEGIN stmlist END IF ID THEN
stm IF ID THEM stm ELSE stm stmlist
stm stmlist SEMI stm
1 P?L 2 S?idid 3 S?while id do S 4 S?begin S
end 5 S?if id then S 6 S?if id the S else
S 7 L?S 8 L?LS
67
Using Parser Generators
  • CUP reports shift-reduce and reduce-reduce
    conflicts.
  • By default, CUP resolves SR conflicts by
    shifting, and RR conflicts by using the rule that
    appears earlier in the grammar.

68
Precedence Directives
  • Ambiguous grammars can still be useful if we can
    find ways to resolve the conflicts.

E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?(E)
E?id E?num E?EE E?E/E E?EE E?E-E E?(E)
However, we can avoid introducing the T and F
symbols and their associated trivial
reductions E?T and T?F.
Ambiguous!
Ambiguities are resolved by introducing T and F.
69
Precedence Directives
  • For example, in state 13 (see Table 3.30, p. 71)
    with lookahead we find a conflict between shift
    into state 8 and reduce by rule 3.
  • Two of the items in state 13 are
  • In this state, the top of stack is ?E E.
  • Shifting will eventually lead to ?E EE.
  • Reducing will lead to the stack ? E and then the
    will be shifted.

E?EE. E?E.E (any)
70
Precedence Directives
  • The parse trees obtained by shifting and reducing
    are

E
E
E

E
E

E
E

E
E

E
Shift
Reduce
If we wish to bind tighter than , we should
reduce instead of shift. So we will fill (13,)
entry in the table with r3 and discard s8.
71
Precedence Directives
  • CUP has precedence directives to indicate the
    resolution of this class of shift-reduce
    conflicts.

precedence nonassoc EQ, NEQ precedence left
PLUS, MINUS precedence left TIMES,
DIV precedence right EXP
72
Syntax Versus Semantics
  • For a PL with arithmetic expressions and boolean
    expressions, arithmetic operators bind tighter
    than the boolean operators.
  • There are arithmetic variables and boolean
    variables.

stm ID ASSIGN ae ID ASSIGN
be be be OR be be AND be
ae EQUAL ae ID ae ae PLUS ae
ID
terminal ID, ASSIGN, PLUS, MINUS,
AND, EQUAL non terminal stm, be, ae start with
stm precedence left OR precedence left
AND precedence left PLUS
73
Syntax Versus Semantics
  • The grammar has a reduce-reduce conflict, as
    shown in Fig. 3.34, p. 76.
  • How should we rewrite the grammar to eliminate
    this conflict?

74
Syntax Versus Semantics
  • The problem is that when the parser sees and
    identifier, it has no way of knowing whether this
    is an arithmetic variable or a boolean variable
    syntactically they look identical.
  • The solution is to defer this analysis until the
    semantic phase of the compiler it is not a
    problem that can be handled naturally with
    context-free grammars.
Write a Comment
User Comments (0)
About PowerShow.com