Title: 3' Parsing
13. Parsing
- Syntax the way in which words are put together
to form phrases, clauses, or sentences.
2R.E.s Are Not Sufficient
- The form 283019 can be defined by the R.E.
- digits0-9
- sum(digits)digits
- The expressions (10923),61,(1(2503)) can be
defined by - digits0-9
- sumexprexpr
- expr(sum)digits
- But it is impossible for an FA to recognize
balanced parentheses!
3R.E.s Are Not Sufficient
- The additional expressive power gained by
recursion is just what we need for parsing. - What we left is a very simple notation, called
context-free grammars(CFGs). - Just as REs can be used to define lexical
structure in a static, declarative way, CFGs
define syntactic structure declaratively. - But will need something more powerful than FA to
parser languages described by grammars.
4Context-Free Grammars
- A language is a set of strings each string is a
finite sequence of symbols taken from a finite
alphabet. - For parsing,
- the strings are source programs,
- the symbols are lexical tokens, and
- the alphabet is the set of token types returned
by the lexical analyzer.
5Context-Free Grammars
- A context-free grammar describes a language.
- A grammar has a set of production of the form
- symbol ? symbol symbol ? symbol
- Each symbol is either
- a terminal, or
- a nonterminal
- A nonterminal is distinguished as the start
symbol of the grammar.
6Context-Free Grammars
Terminal symbols id print num , ( )
An example of grammar 1 S?SS 2 S?idE 3
S?print(L) 4 E?id 5 E?num 6 E?EE 7 E?(S,E) 8
L?E 9 L?L,E
Nonterminal symbols S E L
One sentence in the language of the
grammar idnum idid(idnumnum,id)
The source text might have been a7 bc(d5
6,d)
7Context-Free Grammars
- Derivation
- start with the start symbol, then repeatedly
replace any nonterminal by one of its RHS.
S SS SidE idEidE idnumidE idnumid
EE idnumidE(S,E) idnumidid(S,E) id
numidid(idE,E) idnumidid(idEE,E) i
dnumidid(idEE,id) idnumidid(idnum
E,id) idnumidid(idnumnum,id)
1 S?SS 2 S?idE 3 S?print(L) 4 E?id 5 E?num 6
E?EE 7 E?(S,E) 8 L?E 9 L?L,E
8Context-Free Grammars
- There are many different derivations of the same
sentence. - A leftmost derivation is one in which the
leftmost nonterminal symbol is always the one
expanded - in a rightmost derivation, the rightmost
nonterminal is always next to be expanded. - The previous derivation is neither leftmost or
rightmost.
9Parse Trees
- A parse tree is made by connecting each symbol in
a derivation to the one from which is was derived.
Two different derivations can have the same
parse tree.
10Ambiguous Grammars
- A grammar is ambiguous if it can derive a
sentence with two different parse trees.
E?id E?num E?EE E?E/E E?EE E?E-E E?(E)
Ambiguous !
11Ambiguous Grammars
- Ambiguous grammar are problematic for compiling.
- Let us find an unambiguos grammar that accepts
the same language as the previous grammar - binds tighter than , or has higher precedence
- each operator associates to the left.
12Ambiguous Grammars
This grammar can never produce the following
parse trees
E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?(E)
?X
?U
?Y
?V
13End-Of-File Marker
- Parsers must read not only terminal symbols, but
also the end-of-file marker. - To indicate that (end-of-file) must come after a
complete S-phrase, we augment the grammar with a
new start symbol S and a new production S? S
14Predictive Parsing
A recursive-descent parser
Final int IF1, THEN2, ELSE3,
BEGIN5, END5, PRINT6, SEMI7, NUM8,
EQ9 int tokgetToken() void advance()
tokgetToken() void eat(int t)if (tokt)
advance() else error() void
S() switch(tok) case IF eat(IF)E()eat(THEN)
S() eat(ELSE)S()break case BEGIN
eat(BEGIN)S()L()break case PRINT
eat(PRINT)E()break default error() void
L()switch(tok) case END eat(END)break case
SEMI eat(SEMI)S()L()break default
error() void E()eat(NUM)eat(EQ)eat(NUM)
S?if E then S else S S?begin S L S?print
E L?end L? S L E?num num
15Predictive Parsing
A conflict here
void S() E()eat(EOF) void E()switch(tok) ca
se ? E()eat(PLUS)T()break case ?
E()eat(NIMUS)T()break case ?
T()break default error() void
T()switch(tok) case ? T()eat(TIMES)F()break
case ? T()eat(DIV)F()break case ?
F()break default error()
S?E E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?
(E)
16Predictive Parsing
- Recursive-descent, or predictive, parsing works
only on grammars where the first terminal symbol
of each subexpression provides enough information
to choose which production to use. - We introduce the notion of FIRST sets to resolve
the conflict problem.
17FIRST and FOLLOW Sets
- Given a string ?of terminal and nonterminal
symbols, FIRST(?) is the set of all terminal
symbols that can begin any string derived from ?. - For example, let ?TF.
- Any string of terminal symbols derived from ?
must start with id, num, or (. - Thus FIRST(TF)id, num, (.
18FIRST and FOLLOW Sets
- If two different productions X??1 and X??2 have
the same left-hand side symbol and the right-hand
sides have overlapping FIRST sets, then the
grammar cannot be parsed using predictive parsing.
19FIRST and FOLLOW Sets
- With respect to a particular grammar, given a
string ?of terminals and nonterminals - nullable(X) is true if X can derive the empty
string. - FIRST(?) is the set of terminals that can begin
strings derived from ?. - FOLLOW(?) is the set of terminals that can
immediately follow X.
20FIRST and FOLLOW Sets
- A precise definition of FIRST, Follow and
nullable is that they are the smallest sets for
which these properties hold
For each terminal symbol Z, FIRST(Z)Z for each
production X ?Y1 Y2 ? Yk for each i from 1 to k,
each j from i1 to k, if all the Yi are
nullable then nullableXtrue if Y1 ? Yi-1 are
all nullable then FIRSTXFIRSTX?FIRSTYi if
Yi1 ? Yk are all nullable then
FOLLOWYiFOLLOWYi?FOLLOWX if Yi1 ? Yj-1
are all nullable then FOLLOWYiFOLLOWYi?FOLLOW
Yj
21FIRST and FOLLOW Sets
- Algorithm to compute FIRST, FOLLOW and nullable
for each terminal symbol Z FIRST(Z)?Z repeat for
each production X ?Y1 Y2 ? Yk for each i from 1
to k, each j from i1 to k, if all the Yi are
nullable then nullableX ? true if Y1 ? Yi-1 are
all nullable then FIRSTX ? FIRSTX?FIRSTYi if
Yi1 ? Yk are all nullable then FOLLOWYi ?
FOLLOWYi?FOLLOWX if Yi1 ? Yj-1 are all
nullable then FOLLOWYi ? FOLLOWYi?FOLLOWYj u
ntil FIRST, FOLLOW, and nullable did not change
in this iteration.
22FIRST and FOLLOW Sets
Z?d Z?XYZ Y? Y?c X?Y X?a
23Constructing a Predictive Parser
- If we can choose the right production for each
(X,T), where X is some nonterminal and T is the
next token of the input, then we can write the
recursive-descent parser. - What we need is a 2D table of productions,
indexed by nonterminals X and terminals T. - This is called a predictive parsing table.
24Constructing a Predictive Parser
- To construct a predictive parsing table, enter
production X ? ? in row X, column T of the table
for each T?FIRST?. - Also, if ?is nullable, enter the production in
row X, column T for each T?FOLLOW?.
25Constructing a Predictive Parser
A Predictive parser
Z?d Z?ZYZ Y? Y?c X?Y X?a
The presence of duplicate entries means that
predictive parsing will, not work on the grammar.
26Constructing a Predictive Parser
- An ambiguous grammar will always lead to
duplicate entries in a predictive parsing table. - If we need to use the language of the previous
grammar as a programming language, we will need
to find an unambiguous grammar. - Grammars whose predictive parsing table contain
no duplicate entries are called LL(1)
(left-to-right, leftmost-derivation, 1-symbol
lookahead).
27Eliminating Left Recursion
- The two productions
- E?ET
- E?T
- are certain to cause duplicate in the LL(1)
parsing table. - The problem is that E appears as the first RHS
symbol in an E-production. - This is called left-recursion. Grammars with left
recursion cannot be LL(1).
28Eliminating Left Recursion
- To eliminate left recursion, we will rewrite
using right recursion.
E?TE E? TE E?
E?ET E?T
29Eliminating Left Recursion
- In general, whenever we have productions X ? X?
and X ? a where a does not start with X, we know
that this derives strings of the form a?, an a
followed by zero or more ?.
30Left Factoring
- We can left-factor the grammar
S?if E then S else S S?if E then S
S?if E then S X X? X?else S
31Error Recovery
32LR Parsing
- The weakness of LL(k) parsing technique is that
they must predict which production to use, having
seen only the first k tokens of the right-hand
side. - A more powerful technique, LR(k) parsing, is able
to postpone the decision until it has seen input
tokens corresponding to the entire right-hand
side of the production in question.
Left-to-right parse, Rightmost-derivation,k-token
lookahead
331 1id4 1id46 1id46num10 1id46E11 1S2 1S23 1
S23id4 1S23id46 1S23id46id20 1S23id46E11
1S23id46E1116 1S23id46E1116(8 1S23id46
E1116(8id4 1S23id46E1116(8id46 1S23id46E
1116(8id46num10 1S23id46E1116(8id46E11 1S
23id46E1116(8id46E16 1S23id46E1116(8id4
6E16num10 1S23id46E1116(8id46E16E17 1S2
3id46E1116(8id46E11 1S23id46E1116(8S12 1S
23id46E1116(8S12,18 1S23id46E1116(8S12,18i
d4 1S23id46E1116(8S12,18E21 1S23id46E1116(
8S12,18E21)22 1S23id46E1116E17 1S23id46E11
1S23S5 1S2
a7bc(d56,d) 7bc(d56,d) 7bc
(d56,d) bc(d56,d) bc(d56,d) b
c(d56,d) bc(d56,d) c(d56,d) c
(d56,d) (d56,d) (d56,d) (d56,d)
d56,d) 56,d) 56,d) 6,d) 6,d) 6,d)
,d) ,d) ,d) ,d) d) ) )
1 S?SS 2 S?idE 3 S?print(L) 4 E?id 5 E?num 6
E?EE 7 E?(S,E) 8 L?E 9 L?L,E
34LR Parsing
- How are the LP parser knows when to shift and
when to reduce? - By using a DFA!
- The DFA is not applied to the input finite
automata are too weak to parse context-free
grammars but to the stack. - The edges of the DFA are labeled by the symbols
(terminals and nonterminals) that can appear on
the stack. - Transition table for Grammar 3.1.
35(No Transcript)
36LR Parsing
- To use this table in parsing, treat the shift and
goto actions as edges of a DFA, and scan the
stack. - For example, if the stack is idE, then the DFA
goes from state 1 to 4 to 6 to 11. - If the next input token is a semicolon, then the
column in state 11 says to reduce by rule 2. - ?the top three tokens are popped and S is pushed.
37LR Parsing
- Rather than rescan the stack for each token, the
parser can remember instead the state reached for
each stack element. Then parsing algorithm is
Look up top stack state, and input symbol, to get
action If action is Shift(n) Advance input
one token push n on stack. Reduce(k) Pop stack
as many items as the number of symbols on the
RHS of rule k Let X be the LHS symbol of rule
k In the state now on top of stack, look up X to
get goto n Push n on top of stack. Accept
Stop parsing, report success. Error Stop
Parsing, report failure.
38LR(0) Parser Generation
- An LR(k) parser uses the content of its stack and
the next k tokens of the input to decide which
action to take. - The previous table shows the use of one symbol of
lookahead. - For k2, the table has column for every two-token
sequence ,etc in practice, k gt1 is not used for
compilation, because of - huge table, and
- LR(k) is sufficient for most reasonable PL.
39LR(0) Parser Generation
- Initially, the stack is empty, and the input will
be a complete S-sentence followed by . - We indicate this as S?.S, where the dot
indicates the current position of the parser. - In this state, it begins with any possible RHS of
an S-production.
S ?.(L) An example of LR(0) item
S?S S ?(L) S ?x L ?S L ?L.S
40LR(0) Parser Generation
- Shift actions
- In state 1, if we shift an x, then the top of
stack will have an x. - We indicate that by
2
S ?x.
- Consider shifting a left parenthesis, the dot is
moved one position right.
41LR(0) Parser Generation
- Goto actions
- In state 1, consider the parsing past some string
of tokens derived by the S nonterminal.
42LR(0) Parser Generation
- Reduced actions
- In state 2, a dot is at the end of an item, which
means that on top of the stack there must be a
complete RHS of the corresponding production, and
is ready for reduction.
43LR(0) Parser Generation
- The basic operations we have been performing on
states are closure(I), and goto(I,X), where I is
a set of items and X is a grammar symbol. - Closure(I) adds more items to a set of items when
there is a dot to the left of a nonterminal - goto moves the dot past the symbol X in all items.
44LR(0) Parser Generation
Closure(I) repeat for any item A??.X? in I
for any production X?? I?I?X?.? until I
does not change. Return I
Goto(I,X) set J to the empty set for any
item A??.X? in I add A??X.? to J return
Closure(J)
45LR(0) Parser Generation
- The algorithm for LR(0) parser construction
- First, augment the grammar with an auxiliary
start production S?S. - Let T be the set of states seen so far, and
- E the set of (shift or goto) edges found so far.
46LR(0) Parser Generation
Initialize T to Closure(S?S) Initialize E
to empty. repeat for each state I in T
for each item A??.X? in I let J be
goto(I,X) T?T?J
E?E?I?J until E and T did not change in this
iteration
X
For the symbol we do not compute goto(I,)
instead we will make an accept action.
47LR(0) Parser Generation
2
x
x
S?x.
S
x
(
(
,
S
L
(
)
S
7
L?S.
LR(0) states for Grammar 3.19.
48LR(0) Parser Generation
- Compute set R of LR(0) reduce actions
R? for each state I in T for each item
A??.in I R?R ?(I, A??)
49LR(0) Parser Generation
- Construct a parsing table
- For each I?J, where X is a terminal, we put a
shift J at position (I,X) of the table if X is a
nonterminal we put a goto J at position (I,X). - For each state I containing an item S?S. we
put an accept action at (I,). - For a state containing an item A??.(production n
with the dot at the end), we put a reduce n
action at (I,Y) for every token Y.
X
50LR(0) Parser Generation
LR(0) parsing table for Grammar 3.19.
51LR(0) Parser Generation
- In principle, since LR(0) needs no lookahead, we
just need a single action for each state a state
will shift or reduce, but not both. - In practice, since we need to know what state to
shift into, we have rows headed by state numbers
and columns headed by grammar symbols.
52SLR Parser Generation
0 S?E 1 E ?ET 2 T ?(E) 3 T ?x
- In state 3, on symbol , there is a duplicate
entry. - This is a conflict and indicates that the grammar
is not LR(0). - We need a more powerful parsing algorithm.
53SLR Parser Generation
- A simple way of constructing better-than-LR(0)
parser is called SLR. - Parser constructed for SLR is almost identical to
that for LR(0), except that we put reduce action
into the table only where indicated by the FOLLOW
set.
R? for each state I in T for each item
A??.in I for each token X in FOLLOW(A)
R?R ?(I, X, A??)
In state I, on lookahead symbol X, the parser
will reduce by rule A??
54SLR Parser Generation
- The SLR states and parsing table becomes
1
E
S?.E E?.TE E?.T T?.x
0 S?E 1 E?TE 2 E?T 3 T?x
T
T
4
x
E?T.E E?.TE E?.T T?.x
E
55LR(1)
- Even more powerful than SLR is the LR(1) parsing
algorithm. - Most programming languages whose syntax is
describable by context-free grammar have an LR(1)
grammar.
56LR(1)
- The algorithm for constructing an LR(1) parsing
table is similar to that for LR(0), but the
notion of an item is more sophisticated. - An LR(1) item consists of a grammar production, a
right-hand-side position, and a lookahead symbol. - The idea is that an item(A??.?,x) indicates that
the sequence ? is on top of the stack, and at the
head of the input is a string derivable from ?x.
57LR(1)
- An LR(1) state is a set of LR(1) items, and there
are closure and goto operations for LR(1)
Closure(I) repeat for any item (A??.X?,z) in
I for any production X?? for any
w?FIRST(?z) I?I?(X?.?, w) until I does
not change. return I
Goto(I,X) set J to the empty set for any
item (A??.X?,z) in I add (A??X.?,z) to J
return Closure(J)
58LR(1)
- The start state is the closure of the item
(S?.S,?), where the lookahead will not matter. - The reduce actions are chosen by this algorithm
R? for each state I in T for each item
(A??.,z) in I R?R ?(I,z, A??)
59LR(1)
E
0 S?E 1 E?TE 2 E?T 3 T?x
3
E?T.E E?T.
T
T
4
x
E?T.E E?.TE E?.T T?.x T?.x
x
E
60LALR(1)
- LR(1) parsing table can be very large, with many
states. - A smaller table can be made by merging two states
whose items are identical except for lookahead
sets. - For example, the items in state 5 of the previous
LR(1) parser are identical if the lookahead sets
are ignored. - The result parser is called an LALR(1) parser.
- Merging states 5 and 7 gives the parsing table
the same as the SLR one, but this is not always
the case.
61Hierarchy of Grammar Classes
62LR Parsing of Ambiguous Grammars
- Many programming languages have grammar rules
such as - S?if E then S else S
- S?if E then S
- For example, the statement,
- if a then if b then s1 else s2
- could be understood in two ways
- (1) if a then if b then s1 else s2
- (2) if a then if b then s1 else s2
- In the LR parsing table, there will be a
shift-reduce conflict
S?if E then S. else S?if E then S.
else S (any)
63LR Parsing of Ambiguous Grammars
- It is often possible to use ambiguous grammar by
resolving SR conflict in favor of S or R, as
appropriate. - But it is best to use this technique sparingly,
and only in cases that are well understood.
64Using Parser Generators
- CUP (Construction of Useful Parsers) is a LR
parser-generator tool, modeled on the classic
Yacc (Yet another compiler-compiler). - A CUP specification has a preamble, which
declares lists of terminal symbols, nonterminals,
etc. - The preamble also specifies how the parser is to
be attached to a lexical analyzer and other such
details.
65Using Parser Generators
- The grammar rules are productions of the form
- exp exp PLUS exp semantic actions
- where exp is a nonterminal producing an RHS of
expexp, and PLUS is a terminal symbol (token). - The semantic action is written in ordinary Java
and will be executed whenever the parser reduces
using this rule.
66Using Parser Generators
terminal ID, WHILE, BEGIN, END, DO, IF,
THEN, ELSE, SEMI, ASSIGN non terminal
prog, stm, stmlist start with prog prog
stmlist stm IS ASSIGN ID WHILE ID DO
stm BEGIN stmlist END IF ID THEN
stm IF ID THEM stm ELSE stm stmlist
stm stmlist SEMI stm
1 P?L 2 S?idid 3 S?while id do S 4 S?begin S
end 5 S?if id then S 6 S?if id the S else
S 7 L?S 8 L?LS
67Using Parser Generators
- CUP reports shift-reduce and reduce-reduce
conflicts. - By default, CUP resolves SR conflicts by
shifting, and RR conflicts by using the rule that
appears earlier in the grammar.
68Precedence Directives
- Ambiguous grammars can still be useful if we can
find ways to resolve the conflicts.
E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?(E)
E?id E?num E?EE E?E/E E?EE E?E-E E?(E)
However, we can avoid introducing the T and F
symbols and their associated trivial
reductions E?T and T?F.
Ambiguous!
Ambiguities are resolved by introducing T and F.
69Precedence Directives
- For example, in state 13 (see Table 3.30, p. 71)
with lookahead we find a conflict between shift
into state 8 and reduce by rule 3. - Two of the items in state 13 are
- In this state, the top of stack is ?E E.
- Shifting will eventually lead to ?E EE.
- Reducing will lead to the stack ? E and then the
will be shifted.
E?EE. E?E.E (any)
70Precedence Directives
- The parse trees obtained by shifting and reducing
are
E
E
E
E
E
E
E
E
E
E
Shift
Reduce
If we wish to bind tighter than , we should
reduce instead of shift. So we will fill (13,)
entry in the table with r3 and discard s8.
71Precedence Directives
- CUP has precedence directives to indicate the
resolution of this class of shift-reduce
conflicts.
precedence nonassoc EQ, NEQ precedence left
PLUS, MINUS precedence left TIMES,
DIV precedence right EXP
72Syntax Versus Semantics
- For a PL with arithmetic expressions and boolean
expressions, arithmetic operators bind tighter
than the boolean operators. - There are arithmetic variables and boolean
variables.
stm ID ASSIGN ae ID ASSIGN
be be be OR be be AND be
ae EQUAL ae ID ae ae PLUS ae
ID
terminal ID, ASSIGN, PLUS, MINUS,
AND, EQUAL non terminal stm, be, ae start with
stm precedence left OR precedence left
AND precedence left PLUS
73Syntax Versus Semantics
- The grammar has a reduce-reduce conflict, as
shown in Fig. 3.34, p. 76. - How should we rewrite the grammar to eliminate
this conflict?
74Syntax Versus Semantics
- The problem is that when the parser sees and
identifier, it has no way of knowing whether this
is an arithmetic variable or a boolean variable
syntactically they look identical. - The solution is to defer this analysis until the
semantic phase of the compiler it is not a
problem that can be handled naturally with
context-free grammars.