Parsing - PowerPoint PPT Presentation

About This Presentation
Title:

Parsing

Description:

Recall that a string x is in L(G), if x can be derived by applying a sequence of ... Since any string in L(G) should be generated with the start symbol S, the ... – PowerPoint PPT presentation

Number of Views:274
Avg rating:3.0/5.0
Slides: 19
Provided by: par2
Learn more at: http://www.cs.rpi.edu
Category:
Tags: parsing | string

less

Transcript and Presenter's Notes

Title: Parsing


1
Parsing

For a given CFG G, parsing a string w is to
check if w ? L(G) and, if it is, to find a
sequence of production rules which derive w.
Since, for a given language L, there are many
grammars which generates the same language L,
parsing must be done based on grammar G, not on
its language L(G). Consider the following two
context-free grammars G1 and G2 which generate
the same language aibi i gt 0 .
G1 S ? aSb ab
G2 S ? aA A ? Sb b Clearly, the
following PDA recognizes this language. However,
this PDA does not provide any information for
identifying the grammar or a way for generating a
given string with one of the grammars.
2
Two Derivation Rules
Recall that if a context-free grammar G is
unambiguous, for each string x in L(G), there is
unique parse tree that yields x. So parse tree
could be a good output form for parsing. However,
it is not practical to output parse trees in two
dimensional form. How about representing them in
one-dimensional form, i.e., a sequence of
productions rules applied? There is a problem
with this approach in general there can be more
than one sequence of productions rules that
generate the same string x. This is true even
when the grammar is unambiguous. Recall that a
string x is in L(G), if x can be derived by
applying a sequence of production rules (one rule
at a time) in G. Suppose that aBAbD is derived in
the middle of such sequence. The final result is
irrelevant to which nonterminal symbol is chosen
in aBAbD to derive next sentential form (i.e.,
string of terminals and nonterminals). We
should choose one from such multiple sequences of
production rules. The sequence should be uniquely
identifiable and effective to work with. There
are two ways for the derivation defined as
follows that can be uniquely identifiable.
Leftmost (rightmost) derivation A string is
derived by iteratively applying a production rule
with the leftmost (rightmost) nonterminal symbol
of the current sentential form.
3
  • For the following grammar G, the leftmost and
    rightmost derivations are as shown below.

G S ? ABC A ? aa B ? a C ? cC c
Leftmost derivation S ? ABC ? aaBC ?
aaaC ?
aaacC ? aaacc Rightmost derivation S ? ABC ?
ABcC ? ABcc
? Aacc ? aaacc

Notice that the sequence of productions applied
with the leftmost derivation rule corresponds to
the top-down left-to-right traversal of the parse
tree, and the reverse sequence applied with the
rightmost derivation rule corresponds to
bottom-up left-to-right traversal of the parse
tree.
4
The Basic Strategy for LL(k) Parsing
Now we investigate how we can use a DPDA for
parsing. Consider the following CFG G which
generates language a10x x bi or x ci, i gt
0 . (For convenience, when we refer a rule of G,
we shall use the rule number shown above each of
the rules.) (1)
(2) (3) (4) (5)
(6) (7) G S ? AB AC
A ? aaaaaaaaaa B ? bB b C ? cC c
We want to design a DPDA which, given
a string x ? a, b on the input tape, outputs a
sequence of production rules that generates x, if
x ? L(G). We assume that the machine has an
output port as shown in the figure below, and the
grammar is stored in the memory as a lookup
table. Lets first try a simple greedy
strategy of generating a string in the stack that
matches string x appearing in the input tape.
Since any string in L(G) should be generated with
the start symbol S, the machine initially pushes
S in the stack entering in a working state q1,
and examine the input to choose a proper
production rule for S. Recall that the
conventional PDA sees the stack top, which is S,
and decides whether it will read the next input
symbol or not.

5
Without reading the input, the machine has no
information available for choosing rule (1) or
(2) for S. So we let the machine read the input.
Suppose that the symbol read is a as shown in the
figure below. This information does not help,
because both rules (1) and (2) generates the same
leading as (actually 10 as). The bs located
after as in the input string indicate that the
first production rule to apply to generate the
input string is rule (1), which is S ?
aaaaaaaaaaB. Using the conventional DPDA, it is
impossible to correctly choose this production
rule.
To overcome this problem we equip the DPDA
with the capability of looking ahead the input
string by some constant k cells. For the current
grammar the look-ahead length k should be at
least 11, because the first symbol b appears 11
cells away from the current input position.
(Notice that the count includes current cell
under the read head.) This symbol b is the
nearest information in the input string that
helps for choosing the correct production rule
for S, rule (1) for the example.
6
Now, by looking 11 symbols ahead the machine
knows that the input string should be derived by
applying production rule (1) first, if it is a
string generated by grammar G. So the machine
replaces S in the stack top with the right side
string of rule (1) and output rule number (1) as
shown in the figure below. (Notice that looking
ahead does not involve any move of the read
head.) Whenever a terminal symbol appears at the
stack top, the machine reads the input symbol,
compares with the stack top and pops it if they
match. Otherwise, the input is rejected.
a a a a a a a a a a b b b
(1)
q1
G
A B Z0
(a)
For convenience, let (q, ?, ?) denote a
configuration of the machine with current state
(including G) q, the input portion ? to be read,
and current stack content ?. From now on we shall
use this triple for the machine configuration
instead of a diagram.
7
The initial configuration (q0, aaaaaaaaaabbb,
Z0) is routinely changed to ready configuration
(q1, aaaaaaaaaabbb, SZ0). Based on the
information looked ahead 11 positions, this
configuration has been changed by applying rule
(1) as shown below. Then seeing A at the stack
top, the machine replaces A with the right side
of rule (3). For this operation the machine does
not need to look ahead because there is only one
production rule for A. Now, the first 10 as of
the input can be successfully matched one by one
with the 10 as appearing at the stack top as
follows. (The number above the arrows refer the
production rule applied.)

Now symbol B appears at the stack top. Which
of the production rules B ? bB b should have
been applied to generate next input symbol b?
Since there are more than one b, the next input b
must be generated by rule B ? bB. To see if there
are more than one b, the machine needs to look
ahead 2 cells. Thus, the machine applies rule B ?
bB whenever it sees two bs ahead, and applies
rule B ? b when it sees one b. This way the
machine successfully parse the the remaining
input as the following slide shows. The last
configuration (q0, ?, Z0), with empty stack and
null input to parse, implies that the parsing has
successfully completed. Its output is the
sequence of production rules applied when a
nonterminal symbol appears at the stack top.
8
The sequence of productions applied by this
machine is shown below, which follows exactly the
order of leftmost derivation.
We can easily see that the machine, given a
string x in the input tape, can correctly
generate the sequence of production rules in the
order applied for the leftmost derivation for x
if and only if x is in L(G). This machine parses
the input string reading left-to-right looking
ahead at most 11 cells and generates the sequence
of productions rules applied according to
leftmost derivation. We call this machine LL(11)
parser. Conventionally LL(k) parser is
represented with a table that shows, depending on
the nonterminal symbol appearing at the stack top
and look-ahead contents, which production rule
should be applied. Reading the input symbols to
match stack top terminal symbols and popping
operations are usually omitted for convenience.
9
The parse table for the above example is shown
below, where blank entries are for the cases not
defined (i.e., the input is rejected), and x in
the look-ahead contents is a dont care (wild
card) symbol.
10
Example 1. Construct an LL(k) parser for the
following CFG with minimum k.
This grammar generates the language aiaabbbbi
i ? 0 . Consider string aaaabbbbb and its left
most derivation
Notice that aabbb at the center of this string
is generated by rule S ? aabbb. If we let our
parser look ahead 3 cells, it can select correct
production rule that generates the next input
symbol as follows. If it sees aaa, then the
first a in this look-ahead contents must have
been generated by rule S ? aSb. If it is aab,
then this string aab, together with the
succeeding two bs, if any, must have been
generated by production rule S ? aabbb. Based
on this observation our LL(3) parser parses
string aaaaabbbbbb as follows. First the parser
gets ready by pushing S into the stack.
(q0, aaaaabbbbbb, Z0) ? (q1, aaaaabbbbbb, SZ0) ? ?
11
Our parser, looking aaa ahead, applies rule (1) S
? aSb, and seeding a at the stack top, pop it
reading a from the input tape. Thus, the
configuration changes as follows.
Again, looking aaa ahead, the parser applies rule
(1) S ? aSb two more times as follows.
Now, our parser looks aab ahead, applies rule (2)
S ? aabbb and then matches remaining input
symbols with the ones appearing on the stack top
as follows.
(q1 , aabbbbbb, SbbZ0 ) ? (q1 , aabbbbbb,
aabbbbbZ0 ) ? ? (q1 , ?, Z0)
12
The sequence of productions applied (1) (1) (1)
(2) is exactly the one applied for the leftmost
derivation deriving aaaaabbbbbb. Actually the
parser derived the string in the stack according
to the leftmost derivation rule. Clearly, this
parser operates according to the following
parsing table.
13
Example 2. Construct an LL(K) parser with minimum
k for the following grammar.
We will build an LL(2) parser by examining how it
can parse string ababaaaa by deriving it in the
stack according to the following leftmost
derivation.
Following the routine initialization operation we
have S at the stack top as follows.
(q0, ababaaaa, Z0) ? (q1, ababaaaa, SZ0) ? ?
By looking ahead two cells on the tape, the
parser sees ab. Definitely the next rule applied
in the leftmost derivation is rule (1), which is
the only rule producing ab to the left. So the
parser applies rule (1) and the configuration
changes as follows
14
(1)
(q1, ababaaaa, SZ0) ? (q1, ababaaaa, abAZ0) ? . .
? (q1, abaaaa, AZ0) ??
For convenience the grammar is repeated here.
If rule (4) were applied for A, the next terminal
symbol appearing in the next cell would be b, not
a. The next input symbol must be generated by S.
So, rule (3) must have been applied next to
generate the input string. Thus the configuration
is changed as follows.
Now, the parser looks ahead two cells and sees
ab. Rule (1) must have been applied to derive the
input string. If rule (2) were applied, the two
look ahead contents would be either ? (for the
case of null input string) or aa (generated by
rule (3)). Thus the parser applies rule (1) for S
on the stack top, and changes its configuration
as follows.

Seeing ab ahead, the parser applies S ? abA and
A ? Saa in sequence by observations (c) and (d),
respectively. ? (q1 , aaaa, AaaZ0 ) ? (q1 , aaaa,
SaaaaZ0 ) ?? Seeing aa ahead with S at stack
top, it applies S ? ? by observation (b) above.
? (q1 , aaaa,
aaaaZ0 ) ? ? (q1 , ?, Z0 )
15
Again, since the next input symbol is a, next
rule applied cannot be rule (4). It must be rule
(3). Thus the configuration changes as follows.
Now, the parser looks ahead aa which cannot be
generated by either rule (1) or (2). It must be
generated by rule (3) previously. Thus the parser
applies rule (2) as follows and then matches
remaining input with the string in the stack.
The sequence of rules applied by the parser on
the stack top is exactly same as the sequence
applied for the leftmost derivation deriving
ababaaaa.
16
Clearly, this parser can parse the language with
the following parsing table.
2 look-ahead
ab aa bX BB
Stack top
?
S
abA
?
A
b
Saa
Saa
B blank X dont care
17
Example 3. The grammar below is not an LL(k)
grammar for any fixed integer k.
S ? A B A ? aA 0
B ? aB1 Notice that the language of this
grammar is anx n ? 0, x ? 0, 1 . The
strings in this language can have arbitrary large
number of as followed by either 0 or 1,
depending on whether it is generated by rule S ?
A or S ? B, respectively. With finite look ahead
range k it is impossible to look ahead the
crucial indicator (0 or 1) that is needed to
decide which production rule is applied to
generated the input string. For the given
grammar, there is not LL(k) parser for any finite
k.
It is easy to see that for the following grammar,
which generates the same language, we can
construct an LL(1) parser.
S ? aS D D ? 0 1
18
Formal Definition of LL(k) Grammars
  • Notation Let (k)? denote the prefix of length k
    of string ?. If ? ? ? lt k, (k)? ?.
  • For example, (2)ababaa ab and (3)ab ab.
  • Definition (LL(k) grammar). Let G (VT, VN, P,
    S) be a CFG. Grammar G is an
  • LL(k) grammar for some fixed integer k, if it has
    the following property For any
  • two leftmost derivations
  • S ? ?A? ? ??? ? ?y and
  • S ? ?A? ? ??? ? ?x
  • , where ?, ?,? ?(VT ?VN) and ?, x, y ? VT, if
    (k)x (k)y, then it satisfies ? ?.

If a CFG G has this property, then, for every x ?
L(G), we can decide the sequence of leftmost
derivations which generates x by scanning x left
to right, looking ahead at most k symbols. (If
you are interested in the proof of this
claim, see The Theory of Computation by D. Wood,
or a book for compiler construction.)
Write a Comment
User Comments (0)
About PowerShow.com