Top-Down Parsing

About This Presentation

Title:

Top-Down Parsing

Description:

Title: Analysis & Design of Algorithms Created Date: 1/19/2004 4:38:24 AM Document presentation format: On-screen Show (4:3) Other titles: Verdana Arial Wingdings ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 99

Provided by: elearning2

Category:

more less

Transcript and Presenter's Notes

Title: Top-Down Parsing

1
Top-Down Parsing

Top-Down Parsing by Recursive-Descent
LL(1) Parsing
First and Follow Sets
Error Recovery in Top-Down Parsers

2
Top-Down Parsing

A top-down parsing algorithm parses an input
string of tokens by tracing out the steps in a
leftmost derivation
This is called top-down because the implied
traversal of the parse tree is in preorder, and
thus occurs from the root to the leaves

3
Top-Down Parsing

Top-down parsers come in two forms
Backtracking parsers
Predictive parsers
A predictive parser attempts to predict the next
construction in the input string using one or
more lookahead tokens
A backtracking parser will try different
possibilities for a parse, backing up an
arbitrary amount when it finds that it is mistaken

4
Top-Down Parsing

Backtracking parsers generally are more powerful
than predictive ones
But, theyre also considerably slower they
require exponential time to complete a parse
This means that backtracking parsers are
unsuitable for production-grade compilers

5
Top-Down Parsing

Well study the to most common forms of top-down,
predictive parsing
Recursive-descent parsing
LL(1) parsing
Recursive-descent parsing is very versatile, easy
to implement, and is suitable for generating
a parser by hand

6
Top-Down Parsing

LL(1) parsing is no longer used in practice, but
it serves as a good introduction to the notions
well need later those involving bottom-up
rather than top-down parsing

7
LL(1) Parsing

The LL(1) parsing method gets its name from
First L process the input from left to right
(some early parsing techniques processed from
right to left no longer done today)
Second L uses a leftmost derivation for the
input string
The 1 indicates that only one token of input
is used to predict the direction of the parse

8
Lookahead Sets

Both recursive-descent and LL(1) parsing
generally require the computation of sets called
First and Follow
But, a simple top-down parser can be constructed
without calculating these sets so, well
examine this case first

9
Top-Down Parsing by Recursive-Descent

The basic idea of recursive-descent is simplicity
itself!
We view a grammar rule for a nonterminal A as a
definition for a (recursive) procedure that will
recognize A
The RHS of the rule specifies precisely what must
be done in order to recognize A

10
Top-Down Parsing by Recursive-Descent

In other words, the rules of the grammar are the
specification of a program for recognizing the
sentences of the language!

11
Top-Down Parsing by Recursive-Descent

For example
Start able Baker charlie
Baker delta
These two productions define
Start() Baker()
Match(able) Match(delta)
Baker()
Match(charlie)

12
Top-Down Parsing by Recursive-Descent

Assume for a moment that Match() is a primitive
function that calls the scanner
It returns normally if it is successful
It throws an exception if it is unsuccessful

13
Top-Down Parsing by Recursive-Descent

Clearly to make this approach work we need to be
able to handle
Concatenation (done!)
Alternation (either or)
Repetition (Kleene and )
Multiple rules with the same LHS
I.e., we need to be able to handle BNF and EBNF
Some kind of error recovery would be nice

14
Top-Down Parsing by Recursive-Descent

Consider the expression grammar from the previous
chapter
exp exp addop term term
addop -
term term mulop factor term
mulop
factor ( exp ) number
Consider the rule for factor

15
Top-Down Parsing by Recursive-Descent

Heres some pseudocode for factor
procedure factor
begin
case token of
( match( ( )
exp
match( ) )
number
match( number )
else error
end factor

16
Top-Down Parsing by Recursive-Descent

It is assumed that there is a token that keeps
the current next token in the input (so that this
example uses one symbol of lookahead)
We also assume a match procedure that matches the
current next token with its parameter. It
advances the input if it succeeds, and declares
an error if it fails

17
Top-Down Parsing by Recursive-Descent

Pseudocode for match
procedure match( expectedToken )
begin
if token expectedToken then
getToken
else
error
end

18
Top-Down Parsing by Recursive-Descent

Each reference to a nonterminal on the RHS
becomes a call to a procedure by that name
Each reference to a terminal on the RHS becomes a
call to match with the terminal as argument
So far things are relatively simple and
straightforward
Things are about to change

19
Repetition and Choice EBNF

Consider the simplified BNF syntax for an
if-statement
ifStmt if ( exp ) statement
if ( exp ) statement else
statement

20
Repetition and Choice EBNF

This can be translated into
proc ifStmt ()
begin
match ( if )
match ( ( )
exp()
match ( ) )
statement ()
if token else then
match ( else )
statement
end if
end ifStmt

21
EBNF vs. BNF

This procedure demonstrates the fact that we
could not distinguish which of the two forms of
if-statement we have until we encounter (or
dont) the else
It corresponds far more precisely to the EBNF
ifStmt if ( exp ) stmt else stmt

22
EBNF vs. BNF

EBNF notation is designed to mirror the actual
code that one would produce in a
recursive-descent parser!
So, its excellent for our purposes

23
EBNF vs. BNF

Consider the BNF syntax
exp exp addop term term
In recursive-descent pseudocode you can see that
youd wind up with infinite recursion
But, if you rephrase this using EBNF
exp term addop term
there is no difficulty

24
EBNF vs. BNF

The resulting pseudocode looks like this
proc exp ()
begin
term ()
while token or token -
match ( token )
term ()
end while
end exp

25
Extending to Semantics

We need to be able to extend the syntax to
include semantics
And, we want to be certain that arithmetic
operations are left-associative, as expected
Well not handle the syntax portion just now
But, we can extend the pseudocode as follows

26
Extending to Semantics

function exp() integer
var tmp integer
begin
tmp term()
while token or token -
case token of
match ( ) tmp tmp
term()
- match ( - ) tmp tmp
term()
end case
end while
return tmp
end exp

27
Extending to Semantics

This method of turning an EBNF grammar into code
is very powerful
One can use it to create complete compilers or
complete interpreters

28
Extending to Semantics

One must be careful to set up a collection of
conventions regarding keeping token current, what
match() really does, how getToken() performs,
etc
But, there are no significant challenges or
obstacles to using this approach
Moreover, one can use this approach to create a
syntax tree

29
Building a Syntax Tree

Consider the syntax tree for 345
5
3 4
The node representing the sum of 3 and 4 must
be created before the node representing its sum
with 5

30
Building a Syntax Tree

We could use the following pseudocode
function exp () syntaxTree
var tmp, newTmp syntaxTree
begin
tmp term
while token or token -
newTmp makeOpNode ( token )
match ( token )
leftChild ( newTmp ) tmp
rightChild ( newTmp ) term
tmp newTmp
end while
return tmp
end exp

31
Building a Syntax Tree

Weve introduced a new function makeOpNode
that creates a new node (for an operator)
Nodes are assumed to be binary tree nodes, with
room for one piece of data, a left child, and a
right child
The data can be an operator or a value (so
thered likely be tag to distinguish these cases)

32
Building a Syntax Tree

Note that the pseudocode does, indeed, produce a
syntax tree and not a parse tree
The flexibility of the recursive-descent method
that weve described makes it the method of
choice for hand-generated parsers (compilers,
interpreters)

33
Some Problems (1)

First, it may be difficult to translate a BNF
grammar into an equivalent EBNF grammar
You must be certain that the original and
final grammars do, indeed, describe the identical
languages

34
Some Problems (2)

What if you have a production like
A ? a ß
where a and ß both begin with
non-terminals?
How can you tell which production is the right
one to use?
The answer to this question requires the
computation of the First sets of a and ß the
set of tokens that can legally begin each string

35
Some Problems (3)

What happens if we have a
e-production?
In this case it may be necessary to know what
tokens can legally come after a nonterminal
This requires the computation of the Follow
set of the nonterminal

36
Some Problems (4)

What about error detection?
We want to detect incorrect syntax as early as
possible
Wed like to be able to recover from an error and
continue to parse
Further, we may want to attempt to correct an
error if its possible to do so

37
Basic LL(1) Parsing

LL(1) parsing uses an explicit stack rather than
recursive calls to perform a parse
Its helpful to visualize this stack in a
standard way so that the actions of the LL(1)
parser can be seen and discussed

38
Basic LL(1) Parsing

Well use this very simple grammar to illustrate
things
S ? ( S ) S e
This grammar produces strings of balanced
parentheses
L(S) e, (), ()(), (()),

39
Basic LL(1) Parsing

Input ( )
bottom of stack EOF after input

40
Basic LL(1) Parsing

The general pattern is
We start with
StartSymbol InputString
...
Accept!
A top-down parser parses by replacing a
nonterminal at the top of the stack by one of
the choices provided by the grammar rules

41
Basic LL(1) Parsing

It selects the correct rule by examining the next
input symbol (the top of the input string stack)
There are two actions
Replace a nonterminal A at the top of the stack
by a string using a rule
Match a token on the top of the stack with the
next input token remove them

42
Basic LL(1) Parsing

If we want to construct a parse tree as the parse
proceeds, we can add node construction actions as
each nonterminal or terminal is pushed onto the
stack
If we want, we can construct a syntax tree
instead of a parse tree

43
LL(1) Parsing Table Algorithm

Using this parsing method, when a nonterminal A
is at the top of the parsing stack a decision
must be made based on the current input token
(the lookahead token), which grammar rule choice
for A to use when replacing A on the stack

44
LL(1) Parsing Table Algorithm

If a (terminal) token is at the top of the stack
no decision is necessary either it is
identical to the input token and a match occurs
or it isnt identical, and an error occurs
(because the input is incorrect)

45
LL(1) Parsing Table Algorithm

We can express these two choices in a tabular
form by constructing an LL(1) parsing table
This table is a 2-D array indexed by nonterminals
and terminals, and contains production choices to
use at the appropriate parsing step
Well call this table MN,T

46
LL(1) Parsing Table Algorithm

N is the set of nonterminals
T is the set of terminals (tokens)
M is a table of moves or actions to take in
order to perform a parse
Well construct the entries for M in a moment
Any entries that remain empty constitute error
conditions (i.e., indications of bad input)

47
Constructing MN,T

We add entries to M as follows
If A ? a is a production choice, and there is a
derivation a ? a ß, where a is a token, then
add A ? a to the table at location MA, a
If A ? a is a production choice, and there are
derivations a ? e and S ? ß A a
?, where S is the start symbol and a is a token
(or ), then add A ? a to the table at location
MA, a

48
Constructing MN,T

The ideas behind these rules
Given a token a in the input, we wish to select a
rule A ? a if it can produce a
If A derives the empty string (via A ? a , and
if a is a token that can legally come after A in
a derivation, then we want to select A ? a to
make A disappear

49
Constructing MN,T

These rules are a bit difficult to carry out by
hand
But, theyre simplified by the construction of
the First and Follow sets that we mentioned
earlier (but have yet to really define)

50
Definition

An LL(1) grammar is one for which the associated
LL(1) parsing table has at most one production in
each table entry
Note that such a grammar is unambiguous

51
Example MN,T for ifStmt
52
Parsing an ifStmt using MN,T

Lets watch the parsing process proceed using the
string
if ( 0 ) if ( 1 ) other else other
Well use some abbreviations
statement S
ifStmt I
elsePart L
exp E
if i
else e
other o

53
Parsing an ifStmt using MN,T
54
Left Recursion and Left Factoring

Repetition and choice in LL(1) parsing suffer
from similar problems to those occurring in
recursive-descent parsing
We solved these problems for recursive-descent
parsing by moving to EBNF notation
We cant use the same technique here we must
rewrite using BNF

55
Left Recursion and Left Factoring

The two standard techniques for solving these
problems are
Left recursion removal
Left factoring
Note that there is no guarantee that using these
techniques will result in an LL(1) grammar!
(Similarly, there was no guarantee about using
EBNF to solve the problems)

56
Left Recursion and Left Factoring

But, in practice, these two techniques are very
useful, because theyre very often successful
And, they can be automated

57
Left Recursion Removal

Why is there a problem?
Because left recursion often is used to make
operations left associative
For example
exp exp addop term term
and
exp exp term
exp term
term

58
Left Recursion Removal

These are both examples of direct left recursion
(or, immediate left recursion)
A more difficult case occurs when one has
indirect left recursion
A B c
B A d

59
Removing Immediate Left Recursion

In the case of immediate left recursion we have
A A ß ?
where ß and ? are strings of terminals and
nonterminals, and ß does not begin with A
We rewrite this as a pair of rules
A ? A
A ß A e

60
Removing Immediate Left Recursion

For example
exp exp addop term term
Becomes
exp term exp
exp addop term exp e

61
Left Recursion Removal

The text describes a more general algorithm which
will handle grammars having no e-productions and
no cycles
In practice, no grammars for programming
languages have cycles
But, they may well have e-productions
(Usually, the e-productions occur in restricted
cases which can be dealt with)

62
Left Recursion Removal

Left recursion removal does not change the
language being recognized, but since the grammar
is changed, the resulting parse trees also are
changed
This may cause complications for the parser
designer and for the resulting compiler

63
Left Recursion Removal

In particular, since the new grammar is not
left associative, creating a corresponding left
associative parse tree becomes somewhat of a
challenge

64
Left Recursion Removal

The challenge is met by passing information from
one portion of the parser to another using
parameters

65
Left Factoring

Left factoring is required when two or more
productions share a common prefix string
For example A a ß a ?
Heres a concrete example
stmtSeq stmt stmtSeq stmt
stmt s

66
Left Factoring

Another concrete example
ifStmt if ( exp ) stmt
if ( exp ) stmt else stmt
An LL(1) parser cannot distinguish between the
alternatives
So, as simple alternative is to factor the a
out on the left and to rewrite the rule as two
rules

67
Left Factoring

So, A a ß a ? becomes
A a A
A ß ?
If we allowed parentheses in BNF, we could
rewrite this as
A a ( ß ? )
Thats exactly how algebraic factoring appears in
arithmetic

68
Left Factoring

Consider our ifStmt example
ifStmt if ( exp ) stmt
if ( exp ) stmt else stmt
The left factored form is
ifStmt if ( exp ) stmt elsePart
elsePart else stmt e

69
LL(1) Problem

Heres a typical case where a grammar for a
programming language fails to be LL(1) because
both assignments and procedure calls begin with
an identifier
Stmt assignStmt callStmt other
assignStmt identifier exp
callStmt identifier ( expList )

70
LL(1) Problem

The grammar is not LL(1) because identifier is
shared as the first token of both assignStmt and
callStmt, and thus could be the lookahead token
for either
Worse the grammar is not in a form that can be
left factored
The text shows a solution ugly

71
First and Follow Sets

In order to complete the discussion of LL(1)
parsing we must develop an algorithm that
constructs the LL(1) parsing table
This involves (finally) computing the First
and Follow sets

72
First Sets

Let X be a grammar symbol (terminal or
nonterminal) or e. Then, the set First(X)
consists of terminals (and possibly e) as
follows
If X is a terminal or e, then First(X)
X
If X is a nonterminal

73
First Sets

If X is a nonterminal, then for each production
choice X X1 X2 X3Xn First(X) contains
First(X1) e. Also, if for some i lt n,
all the sets First(X1),,First(Xi) contain e,
then First(X) contains First(Xi1) e. If
all the sets First(Xi),,First(Xn) contain e,
then First(X) also contains e

74
First Sets

We can extend the definition to First(a), where a
is any string of terminals and nonterminals

75
First Sets

Its pretty easy to see how this definition can
be interpreted in the absence of e -productions
Keep adding First(X1) to First(A) for each
nonterminal A and production A X1 until no
further additions occur
This process is called computing the transitive
closure

76
First Sets

If the grammar has e-productions, then the
situation is more complicated, because some of
the nonterminals may disappear
Such a nonterminal is called nullable
One can find the nullable nonterminals using
transitive closure and then remove them from
First(A)

77
Nullable Nonterminals

Definition A nonterminal A is nullable if there
is a derivation A ? e
Theorem A nonterminal A is nullable iff
First(A) contains e

78
Example of First(A)

Given our simple grammar
exp exp addop term term
addop -
term term mulop factor factor
mulop
factor ( exp ) number
First(exp) (, number
First(term) (, number
First(factor) (, number
First(addop) , -
First(mulop)

79
Follow Sets

Given a nonterminal A, the set Follow(A),
consisting of terminals (and possibly ), is
defined as follows
If A is the Start symbol, the is in Follow(A)
If there is a production B ? a A ?, then First(?)
e is in Follow(A)
If there is a production B ? a A ? such that e is
in First(?), then Follow(A) contains Follow(B)

80
Follow Sets

Note that functions as a token in the
calculation of Follow sets
Note that lambda never is an element of a Follow
set
Follow sets only are defined for nonterminals
Follow sets only contain terminals (just like
First sets)

81
Follow Sets

Lets again examine the grammar
exp exp addop term
exp term
addop
addop -
term term mulop factor
term factor
mulop
factor ( exp )
factor number

82
Follow Sets

First(exp) (, number
First(term) (, number
First(factor) (, number
First(addop) , -
First(mulop)
Follow(exp) , , -, )
Follow(addop) (, number
Follow(term) , , -, , )
Follow(mulop) (, number
Follow(factor) , , -, , )

83
Follow Sets for ifStmt

Consider again the grammar
stmt ifStmt
stmt other
ifStmt if ( exp ) stmt else elsePart
elsePart else stmt
elsePart e
exp 0
exp 1

84
Follow Sets for ifStmt

First(stmt) if, other
First(ifStmt) if
First(elsePart) else, e
First(exp) 0, 1
Follow(stmt) , else
Follow(ifStmt) , else
Follow(elsePart) , else
Follow(exp) )

85
Constructing the LL(1) Parsing Table MN,T

We now have a better way to go about
giving the rules for constructing MN,T
For each token a in First(a), add A ? a to
the entry MA, a
If e is in First(a), for each element of a of
Follow(A) (a token or ), add A ? a to MA, a

86
LL(k)

These ideas can be extended to
k-token lookahead
But, the tables get exponentially large
And, remember, recursive-descent parsing are able
to use lookahead selectively, changing the value
of k dynamically
And, recursive-descent can handle grammars that
are not LL(k) for any k!

87
Error Recovery in Top-Down Parsers

How well the parser responds to syntax errors
frequently determines the usefulness of a
compiler
A parser must, at least, detect whether a program
is syntactically correct
Such a parser is called a recognizer

88
Error Recover

Hopefully a parser will do some amount of error
correction (more properly, error repair)
Most of the time, error repair is limited to
cases that are relatively safe to perform
For example, inserting missing punctuation or
deleting extraneous punctuation

89
Error Recovery

It should be obvious that significant error
repair repair of semantic errors not only is
far beyond the scope of todays compiler in
fact, it theoretically is impossible to
accomplish in the general case
A compiler cannot know a programmers intent it
only can read what a programmer has written

90
Minimal Distance Correction

There are a collection of algorithms that can be
applied to attempt to repair programs where the
correction is performed within some minimal
distance of the detected error
This distance usually is given in terms of some
number of tokens on either side of the error
point

91
Minimal Distance Correction

In practice, even this minimal attempt at error
repair usually is not performed by production
compilers
Compiler writers find themselves challenged far
more than enough just attempting to generate
meaningful error messages

92
General Principles

Here are some general principles that should be
considered
Parsers should determine that an error has
occurred as soon as possible and should indicate
the point of error
After detecting an error, a parser should pick a
likely place to continue parsing
A parser should parse as much code as possible

93
General Principles

Parsers should attempt to avoid the cascading
error problem one error causing subsequent
errors, usually spurious
Parsers should not infinite loop, especially
while issuing warnings and/or error messages
Parsers should issue messages with as much
accuracy and help as possible

94
Error Recovery in Recursive-Descent Parsers

One standard form of error recovery in
recursive-descent parsers is called panic mode
In this mode, parsing is suspended and input
(tokens) are consumed until a recovery point is
identified
Parsing resumes at that point
In the worst case, the entire rest of the program
is not parsed

95
Recovery Points

Identifying likely recovery points is
extremely difficult but, some general rules
usually work
Recover after the end of a statement (after a
semi-colon)
Recover after the conclusion of a control
structure (like a block0
Recover after then end of a method, procedure,
structure, class,

96
Recovery Points

Recovery points are identified using pattern
matching you cannot really use the parser
itself, since its whats causing the problem in
the first place

97
Recovery Points

One way to implement this is to provide each
recursive-descent procedure with another argument
a collection of synchronizing tokens
They are used to re-synchronize the parsing
process in the event that a syntax error is
detected
Generally, Follow sets are good candidates for
synchronizing tokens