3' Parsing

About This Presentation

Title:

3' Parsing

Description:

12/21/09. 1. 3. Parsing. Syntax: the way in which words are put together to form ... for each production X Y1 Y2 Yk. for each i from 1 to k, each j from i 1 to k, ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 75

Provided by: cseTt

Category:

more less

Transcript and Presenter's Notes

Title: 3' Parsing

1
3. Parsing

Syntax the way in which words are put together
to form phrases, clauses, or sentences.

2
R.E.s Are Not Sufficient

The form 283019 can be defined by the R.E.
digits0-9
sum(digits)digits
The expressions (10923),61,(1(2503)) can be
defined by
digits0-9
sumexprexpr
expr(sum)digits
But it is impossible for an FA to recognize
balanced parentheses!

3
R.E.s Are Not Sufficient

The additional expressive power gained by
recursion is just what we need for parsing.
What we left is a very simple notation, called
context-free grammars(CFGs).
Just as REs can be used to define lexical
structure in a static, declarative way, CFGs
define syntactic structure declaratively.
But will need something more powerful than FA to
parser languages described by grammars.

4
Context-Free Grammars

A language is a set of strings each string is a
finite sequence of symbols taken from a finite
alphabet.
For parsing,
the strings are source programs,
the symbols are lexical tokens, and
the alphabet is the set of token types returned
by the lexical analyzer.

5
Context-Free Grammars

A context-free grammar describes a language.
A grammar has a set of production of the form
symbol ? symbol symbol ? symbol
Each symbol is either
a terminal, or
a nonterminal
A nonterminal is distinguished as the start
symbol of the grammar.

6
Context-Free Grammars
Terminal symbols id print num , ( )
An example of grammar 1 S?SS 2 S?idE 3
S?print(L) 4 E?id 5 E?num 6 E?EE 7 E?(S,E) 8
L?E 9 L?L,E
Nonterminal symbols S E L
One sentence in the language of the
grammar idnum idid(idnumnum,id)
The source text might have been a7 bc(d5
6,d)
7
Context-Free Grammars

Derivation
start with the start symbol, then repeatedly
replace any nonterminal by one of its RHS.

S SS SidE idEidE idnumidE idnumid
EE idnumidE(S,E) idnumidid(S,E) id
numidid(idE,E) idnumidid(idEE,E) i
dnumidid(idEE,id) idnumidid(idnum
E,id) idnumidid(idnumnum,id)
1 S?SS 2 S?idE 3 S?print(L) 4 E?id 5 E?num 6
E?EE 7 E?(S,E) 8 L?E 9 L?L,E
8
Context-Free Grammars

There are many different derivations of the same
sentence.
A leftmost derivation is one in which the
leftmost nonterminal symbol is always the one
expanded
in a rightmost derivation, the rightmost
nonterminal is always next to be expanded.
The previous derivation is neither leftmost or
rightmost.

9
Parse Trees

A parse tree is made by connecting each symbol in
a derivation to the one from which is was derived.

Two different derivations can have the same
parse tree.
10
Ambiguous Grammars

A grammar is ambiguous if it can derive a
sentence with two different parse trees.

E?id E?num E?EE E?E/E E?EE E?E-E E?(E)
Ambiguous !
11
Ambiguous Grammars

Ambiguous grammar are problematic for compiling.
Let us find an unambiguos grammar that accepts
the same language as the previous grammar
binds tighter than , or has higher precedence
each operator associates to the left.

12
Ambiguous Grammars
This grammar can never produce the following
parse trees
E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?(E)
?X
?U

?Y
?V

13
End-Of-File Marker

Parsers must read not only terminal symbols, but
also the end-of-file marker.
To indicate that (end-of-file) must come after a
complete S-phrase, we augment the grammar with a
new start symbol S and a new production S? S

14
Predictive Parsing
A recursive-descent parser
Final int IF1, THEN2, ELSE3,
BEGIN5, END5, PRINT6, SEMI7, NUM8,
EQ9 int tokgetToken() void advance()
tokgetToken() void eat(int t)if (tokt)
advance() else error() void
S() switch(tok) case IF eat(IF)E()eat(THEN)
S() eat(ELSE)S()break case BEGIN
eat(BEGIN)S()L()break case PRINT
eat(PRINT)E()break default error() void
L()switch(tok) case END eat(END)break case
SEMI eat(SEMI)S()L()break default
error() void E()eat(NUM)eat(EQ)eat(NUM)
S?if E then S else S S?begin S L S?print
E L?end L? S L E?num num
15
Predictive Parsing
A conflict here
void S() E()eat(EOF) void E()switch(tok) ca
se ? E()eat(PLUS)T()break case ?
E()eat(NIMUS)T()break case ?
T()break default error() void
T()switch(tok) case ? T()eat(TIMES)F()break
case ? T()eat(DIV)F()break case ?
F()break default error()
S?E E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?
(E)
16
Predictive Parsing

Recursive-descent, or predictive, parsing works
only on grammars where the first terminal symbol
of each subexpression provides enough information
to choose which production to use.
We introduce the notion of FIRST sets to resolve
the conflict problem.

17
FIRST and FOLLOW Sets

Given a string ?of terminal and nonterminal
symbols, FIRST(?) is the set of all terminal
symbols that can begin any string derived from ?.
For example, let ?TF.
Any string of terminal symbols derived from ?
must start with id, num, or (.
Thus FIRST(TF)id, num, (.

18
FIRST and FOLLOW Sets

If two different productions X??1 and X??2 have
the same left-hand side symbol and the right-hand
sides have overlapping FIRST sets, then the
grammar cannot be parsed using predictive parsing.

19
FIRST and FOLLOW Sets

With respect to a particular grammar, given a
string ?of terminals and nonterminals
nullable(X) is true if X can derive the empty
string.
FIRST(?) is the set of terminals that can begin
strings derived from ?.
FOLLOW(?) is the set of terminals that can
immediately follow X.

20
FIRST and FOLLOW Sets

A precise definition of FIRST, Follow and
nullable is that they are the smallest sets for
which these properties hold

For each terminal symbol Z, FIRST(Z)Z for each
production X ?Y1 Y2 ? Yk for each i from 1 to k,
each j from i1 to k, if all the Yi are
nullable then nullableXtrue if Y1 ? Yi-1 are
all nullable then FIRSTXFIRSTX?FIRSTYi if
Yi1 ? Yk are all nullable then
FOLLOWYiFOLLOWYi?FOLLOWX if Yi1 ? Yj-1
are all nullable then FOLLOWYiFOLLOWYi?FOLLOW
Yj
21
FIRST and FOLLOW Sets

Algorithm to compute FIRST, FOLLOW and nullable

for each terminal symbol Z FIRST(Z)?Z repeat for
each production X ?Y1 Y2 ? Yk for each i from 1
to k, each j from i1 to k, if all the Yi are
nullable then nullableX ? true if Y1 ? Yi-1 are
all nullable then FIRSTX ? FIRSTX?FIRSTYi if
Yi1 ? Yk are all nullable then FOLLOWYi ?
FOLLOWYi?FOLLOWX if Yi1 ? Yj-1 are all
nullable then FOLLOWYi ? FOLLOWYi?FOLLOWYj u
ntil FIRST, FOLLOW, and nullable did not change
in this iteration.
22
FIRST and FOLLOW Sets
Z?d Z?XYZ Y? Y?c X?Y X?a
23
Constructing a Predictive Parser

If we can choose the right production for each
(X,T), where X is some nonterminal and T is the
next token of the input, then we can write the
recursive-descent parser.
What we need is a 2D table of productions,
indexed by nonterminals X and terminals T.
This is called a predictive parsing table.

24
Constructing a Predictive Parser

To construct a predictive parsing table, enter
production X ? ? in row X, column T of the table
for each T?FIRST?.
Also, if ?is nullable, enter the production in
row X, column T for each T?FOLLOW?.

25
Constructing a Predictive Parser
A Predictive parser
Z?d Z?ZYZ Y? Y?c X?Y X?a
The presence of duplicate entries means that
predictive parsing will, not work on the grammar.
26
Constructing a Predictive Parser

An ambiguous grammar will always lead to
duplicate entries in a predictive parsing table.
If we need to use the language of the previous
grammar as a programming language, we will need
to find an unambiguous grammar.
Grammars whose predictive parsing table contain
no duplicate entries are called LL(1)
(left-to-right, leftmost-derivation, 1-symbol
lookahead).

27
Eliminating Left Recursion

The two productions
E?ET
E?T
are certain to cause duplicate in the LL(1)
parsing table.
The problem is that E appears as the first RHS
symbol in an E-production.
This is called left-recursion. Grammars with left
recursion cannot be LL(1).

28
Eliminating Left Recursion

To eliminate left recursion, we will rewrite
using right recursion.

E?TE E? TE E?
E?ET E?T
29
Eliminating Left Recursion

In general, whenever we have productions X ? X?
and X ? a where a does not start with X, we know
that this derives strings of the form a?, an a
followed by zero or more ?.

30
Left Factoring

We can left-factor the grammar

S?if E then S else S S?if E then S
S?if E then S X X? X?else S
31
Error Recovery

pp. 55-56

32
LR Parsing

The weakness of LL(k) parsing technique is that
they must predict which production to use, having
seen only the first k tokens of the right-hand
side.
A more powerful technique, LR(k) parsing, is able
to postpone the decision until it has seen input
tokens corresponding to the entire right-hand
side of the production in question.

Left-to-right parse, Rightmost-derivation,k-token
lookahead
33
1 1id4 1id46 1id46num10 1id46E11 1S2 1S23 1
S23id4 1S23id46 1S23id46id20 1S23id46E11
1S23id46E1116 1S23id46E1116(8 1S23id46
E1116(8id4 1S23id46E1116(8id46 1S23id46E
1116(8id46num10 1S23id46E1116(8id46E11 1S
23id46E1116(8id46E16 1S23id46E1116(8id4
6E16num10 1S23id46E1116(8id46E16E17 1S2
3id46E1116(8id46E11 1S23id46E1116(8S12 1S
23id46E1116(8S12,18 1S23id46E1116(8S12,18i
d4 1S23id46E1116(8S12,18E21 1S23id46E1116(
8S12,18E21)22 1S23id46E1116E17 1S23id46E11
1S23S5 1S2
a7bc(d56,d) 7bc(d56,d) 7bc
(d56,d) bc(d56,d) bc(d56,d) b
c(d56,d) bc(d56,d) c(d56,d) c
(d56,d) (d56,d) (d56,d) (d56,d)
d56,d) 56,d) 56,d) 6,d) 6,d) 6,d)
,d) ,d) ,d) ,d) d) ) )
1 S?SS 2 S?idE 3 S?print(L) 4 E?id 5 E?num 6
E?EE 7 E?(S,E) 8 L?E 9 L?L,E
34
LR Parsing

How are the LP parser knows when to shift and
when to reduce?
By using a DFA!
The DFA is not applied to the input finite
automata are too weak to parse context-free
grammars but to the stack.
The edges of the DFA are labeled by the symbols
(terminals and nonterminals) that can appear on
the stack.
Transition table for Grammar 3.1.

35
(No Transcript)
36
LR Parsing

To use this table in parsing, treat the shift and
goto actions as edges of a DFA, and scan the
stack.
For example, if the stack is idE, then the DFA
goes from state 1 to 4 to 6 to 11.
If the next input token is a semicolon, then the
column in state 11 says to reduce by rule 2.
?the top three tokens are popped and S is pushed.

37
LR Parsing

Rather than rescan the stack for each token, the
parser can remember instead the state reached for
each stack element. Then parsing algorithm is

Look up top stack state, and input symbol, to get
action If action is Shift(n) Advance input
one token push n on stack. Reduce(k) Pop stack
as many items as the number of symbols on the
RHS of rule k Let X be the LHS symbol of rule
k In the state now on top of stack, look up X to
get goto n Push n on top of stack. Accept
Stop parsing, report success. Error Stop
Parsing, report failure.
38
LR(0) Parser Generation

An LR(k) parser uses the content of its stack and
the next k tokens of the input to decide which
action to take.
The previous table shows the use of one symbol of
lookahead.
For k2, the table has column for every two-token
sequence ,etc in practice, k gt1 is not used for
compilation, because of
huge table, and
LR(k) is sufficient for most reasonable PL.

39
LR(0) Parser Generation

Initially, the stack is empty, and the input will
be a complete S-sentence followed by .
We indicate this as S?.S, where the dot
indicates the current position of the parser.
In this state, it begins with any possible RHS of
an S-production.

S ?.(L) An example of LR(0) item
S?S S ?(L) S ?x L ?S L ?L.S
40
LR(0) Parser Generation

Shift actions
In state 1, if we shift an x, then the top of
stack will have an x.
We indicate that by

2
S ?x.

Consider shifting a left parenthesis, the dot is
moved one position right.

41
LR(0) Parser Generation

Goto actions
In state 1, consider the parsing past some string
of tokens derived by the S nonterminal.

42
LR(0) Parser Generation

Reduced actions
In state 2, a dot is at the end of an item, which
means that on top of the stack there must be a
complete RHS of the corresponding production, and
is ready for reduction.

43
LR(0) Parser Generation

The basic operations we have been performing on
states are closure(I), and goto(I,X), where I is
a set of items and X is a grammar symbol.
Closure(I) adds more items to a set of items when
there is a dot to the left of a nonterminal
goto moves the dot past the symbol X in all items.

44
LR(0) Parser Generation
Closure(I) repeat for any item A??.X? in I
for any production X?? I?I?X?.? until I
does not change. Return I
Goto(I,X) set J to the empty set for any
item A??.X? in I add A??X.? to J return
Closure(J)
45
LR(0) Parser Generation

The algorithm for LR(0) parser construction
First, augment the grammar with an auxiliary
start production S?S.
Let T be the set of states seen so far, and
E the set of (shift or goto) edges found so far.

46
LR(0) Parser Generation
Initialize T to Closure(S?S) Initialize E
to empty. repeat for each state I in T
for each item A??.X? in I let J be
goto(I,X) T?T?J
E?E?I?J until E and T did not change in this
iteration
X
For the symbol we do not compute goto(I,)
instead we will make an accept action.
47
LR(0) Parser Generation
2
x
x
S?x.
S
x
(
(
,
S
L
(
)
S
7
L?S.
LR(0) states for Grammar 3.19.
48
LR(0) Parser Generation

Compute set R of LR(0) reduce actions

R? for each state I in T for each item
A??.in I R?R ?(I, A??)
49
LR(0) Parser Generation

Construct a parsing table
For each I?J, where X is a terminal, we put a
shift J at position (I,X) of the table if X is a
nonterminal we put a goto J at position (I,X).
For each state I containing an item S?S. we
put an accept action at (I,).
For a state containing an item A??.(production n
with the dot at the end), we put a reduce n
action at (I,Y) for every token Y.

X
50
LR(0) Parser Generation
LR(0) parsing table for Grammar 3.19.
51
LR(0) Parser Generation

In principle, since LR(0) needs no lookahead, we
just need a single action for each state a state
will shift or reduce, but not both.
In practice, since we need to know what state to
shift into, we have rows headed by state numbers
and columns headed by grammar symbols.

52
SLR Parser Generation
0 S?E 1 E ?ET 2 T ?(E) 3 T ?x

In state 3, on symbol , there is a duplicate
entry.
This is a conflict and indicates that the grammar
is not LR(0).
We need a more powerful parsing algorithm.

53
SLR Parser Generation

A simple way of constructing better-than-LR(0)
parser is called SLR.
Parser constructed for SLR is almost identical to
that for LR(0), except that we put reduce action
into the table only where indicated by the FOLLOW
set.

R? for each state I in T for each item
A??.in I for each token X in FOLLOW(A)
R?R ?(I, X, A??)
In state I, on lookahead symbol X, the parser
will reduce by rule A??
54
SLR Parser Generation

The SLR states and parsing table becomes

1
E
S?.E E?.TE E?.T T?.x
0 S?E 1 E?TE 2 E?T 3 T?x
T

T
4
x
E?T.E E?.TE E?.T T?.x
E
55
LR(1)

Even more powerful than SLR is the LR(1) parsing
algorithm.
Most programming languages whose syntax is
describable by context-free grammar have an LR(1)
grammar.

56
LR(1)

The algorithm for constructing an LR(1) parsing
table is similar to that for LR(0), but the
notion of an item is more sophisticated.
An LR(1) item consists of a grammar production, a
right-hand-side position, and a lookahead symbol.
The idea is that an item(A??.?,x) indicates that
the sequence ? is on top of the stack, and at the
head of the input is a string derivable from ?x.

57
LR(1)

An LR(1) state is a set of LR(1) items, and there
are closure and goto operations for LR(1)

Closure(I) repeat for any item (A??.X?,z) in
I for any production X?? for any
w?FIRST(?z) I?I?(X?.?, w) until I does
not change. return I
Goto(I,X) set J to the empty set for any
item (A??.X?,z) in I add (A??X.?,z) to J
return Closure(J)
58
LR(1)

The start state is the closure of the item
(S?.S,?), where the lookahead will not matter.
The reduce actions are chosen by this algorithm

R? for each state I in T for each item
(A??.,z) in I R?R ?(I,z, A??)
59
LR(1)
E
0 S?E 1 E?TE 2 E?T 3 T?x
3
E?T.E E?T.
T

T
4
x
E?T.E E?.TE E?.T T?.x T?.x
x
E
60
LALR(1)

LR(1) parsing table can be very large, with many
states.
A smaller table can be made by merging two states
whose items are identical except for lookahead
sets.
For example, the items in state 5 of the previous
LR(1) parser are identical if the lookahead sets
are ignored.
The result parser is called an LALR(1) parser.
Merging states 5 and 7 gives the parsing table
the same as the SLR one, but this is not always
the case.

61
Hierarchy of Grammar Classes

Fig. 3.26, p. 67.

62
LR Parsing of Ambiguous Grammars

Many programming languages have grammar rules
such as
S?if E then S else S
S?if E then S
For example, the statement,
if a then if b then s1 else s2
could be understood in two ways
(1) if a then if b then s1 else s2
(2) if a then if b then s1 else s2
In the LR parsing table, there will be a
shift-reduce conflict

S?if E then S. else S?if E then S.
else S (any)
63
LR Parsing of Ambiguous Grammars

It is often possible to use ambiguous grammar by
resolving SR conflict in favor of S or R, as
appropriate.
But it is best to use this technique sparingly,
and only in cases that are well understood.

64
Using Parser Generators

CUP (Construction of Useful Parsers) is a LR
parser-generator tool, modeled on the classic
Yacc (Yet another compiler-compiler).
A CUP specification has a preamble, which
declares lists of terminal symbols, nonterminals,
etc.
The preamble also specifies how the parser is to
be attached to a lexical analyzer and other such
details.

65
Using Parser Generators

The grammar rules are productions of the form
exp exp PLUS exp semantic actions
where exp is a nonterminal producing an RHS of
expexp, and PLUS is a terminal symbol (token).
The semantic action is written in ordinary Java
and will be executed whenever the parser reduces
using this rule.

66
Using Parser Generators
terminal ID, WHILE, BEGIN, END, DO, IF,
THEN, ELSE, SEMI, ASSIGN non terminal
prog, stm, stmlist start with prog prog
stmlist stm IS ASSIGN ID WHILE ID DO
stm BEGIN stmlist END IF ID THEN
stm IF ID THEM stm ELSE stm stmlist
stm stmlist SEMI stm
1 P?L 2 S?idid 3 S?while id do S 4 S?begin S
end 5 S?if id then S 6 S?if id the S else
S 7 L?S 8 L?LS
67
Using Parser Generators

CUP reports shift-reduce and reduce-reduce
conflicts.
By default, CUP resolves SR conflicts by
shifting, and RR conflicts by using the rule that
appears earlier in the grammar.

68
Precedence Directives

Ambiguous grammars can still be useful if we can
find ways to resolve the conflicts.

E?ET E?E-T E?T T?TF T?T/F T?F F?id F?num F?(E)
E?id E?num E?EE E?E/E E?EE E?E-E E?(E)
However, we can avoid introducing the T and F
symbols and their associated trivial
reductions E?T and T?F.
Ambiguous!
Ambiguities are resolved by introducing T and F.
69
Precedence Directives

For example, in state 13 (see Table 3.30, p. 71)
with lookahead we find a conflict between shift
into state 8 and reduce by rule 3.
Two of the items in state 13 are
In this state, the top of stack is ?E E.
Shifting will eventually lead to ?E EE.
Reducing will lead to the stack ? E and then the
will be shifted.

E?EE. E?E.E (any)
70
Precedence Directives

The parse trees obtained by shifting and reducing
are

E
E
E

E
E

E
E

E
E

E
Shift
Reduce
If we wish to bind tighter than , we should
reduce instead of shift. So we will fill (13,)
entry in the table with r3 and discard s8.
71
Precedence Directives

CUP has precedence directives to indicate the
resolution of this class of shift-reduce
conflicts.

precedence nonassoc EQ, NEQ precedence left
PLUS, MINUS precedence left TIMES,
DIV precedence right EXP
72
Syntax Versus Semantics

For a PL with arithmetic expressions and boolean
expressions, arithmetic operators bind tighter
than the boolean operators.
There are arithmetic variables and boolean
variables.

stm ID ASSIGN ae ID ASSIGN
be be be OR be be AND be
ae EQUAL ae ID ae ae PLUS ae
ID
terminal ID, ASSIGN, PLUS, MINUS,
AND, EQUAL non terminal stm, be, ae start with
stm precedence left OR precedence left
AND precedence left PLUS
73
Syntax Versus Semantics

The grammar has a reduce-reduce conflict, as
shown in Fig. 3.34, p. 76.
How should we rewrite the grammar to eliminate
this conflict?

74
Syntax Versus Semantics

The problem is that when the parser sees and
identifier, it has no way of knowing whether this
is an arithmetic variable or a boolean variable
syntactically they look identical.
The solution is to defer this analysis until the
semantic phase of the compiler it is not a
problem that can be handled naturally with
context-free grammars.

Write a Comment

User Comments (0)