Title: Top-Down Parsing
1Top-Down Parsing
- Top-Down Parsing by Recursive-Descent
- LL(1) Parsing
- First and Follow Sets
- Error Recovery in Top-Down Parsers
2Top-Down Parsing
- A top-down parsing algorithm parses an input
string of tokens by tracing out the steps in a
leftmost derivation - This is called top-down because the implied
traversal of the parse tree is in preorder, and
thus occurs from the root to the leaves
3Top-Down Parsing
- Top-down parsers come in two forms
- Backtracking parsers
- Predictive parsers
- A predictive parser attempts to predict the next
construction in the input string using one or
more lookahead tokens - A backtracking parser will try different
possibilities for a parse, backing up an
arbitrary amount when it finds that it is mistaken
4Top-Down Parsing
- Backtracking parsers generally are more powerful
than predictive ones - But, theyre also considerably slower they
require exponential time to complete a parse - This means that backtracking parsers are
unsuitable for production-grade compilers
5Top-Down Parsing
- Well study the to most common forms of top-down,
predictive parsing - Recursive-descent parsing
- LL(1) parsing
- Recursive-descent parsing is very versatile, easy
to implement, and is suitable for generating
a parser by hand
6Top-Down Parsing
- LL(1) parsing is no longer used in practice, but
it serves as a good introduction to the notions
well need later those involving bottom-up
rather than top-down parsing
7LL(1) Parsing
- The LL(1) parsing method gets its name from
- First L process the input from left to right
(some early parsing techniques processed from
right to left no longer done today) - Second L uses a leftmost derivation for the
input string - The 1 indicates that only one token of input
is used to predict the direction of the parse
8Lookahead Sets
- Both recursive-descent and LL(1) parsing
generally require the computation of sets called
First and Follow - But, a simple top-down parser can be constructed
without calculating these sets so, well
examine this case first
9Top-Down Parsing by Recursive-Descent
- The basic idea of recursive-descent is simplicity
itself! - We view a grammar rule for a nonterminal A as a
definition for a (recursive) procedure that will
recognize A - The RHS of the rule specifies precisely what must
be done in order to recognize A
10Top-Down Parsing by Recursive-Descent
- In other words, the rules of the grammar are the
specification of a program for recognizing the
sentences of the language!
11Top-Down Parsing by Recursive-Descent
- For example
- Start able Baker charlie
- Baker delta
- These two productions define
- Start() Baker()
- Match(able) Match(delta)
- Baker()
- Match(charlie)
12Top-Down Parsing by Recursive-Descent
- Assume for a moment that Match() is a primitive
function that calls the scanner - It returns normally if it is successful
- It throws an exception if it is unsuccessful
13Top-Down Parsing by Recursive-Descent
- Clearly to make this approach work we need to be
able to handle - Concatenation (done!)
- Alternation (either or)
- Repetition (Kleene and )
- Multiple rules with the same LHS
- I.e., we need to be able to handle BNF and EBNF
- Some kind of error recovery would be nice
14Top-Down Parsing by Recursive-Descent
- Consider the expression grammar from the previous
chapter - exp exp addop term term
- addop -
- term term mulop factor term
- mulop
- factor ( exp ) number
- Consider the rule for factor
15Top-Down Parsing by Recursive-Descent
- Heres some pseudocode for factor
- procedure factor
- begin
- case token of
- ( match( ( )
- exp
- match( ) )
- number
- match( number )
- else error
- end factor
16Top-Down Parsing by Recursive-Descent
- It is assumed that there is a token that keeps
the current next token in the input (so that this
example uses one symbol of lookahead) - We also assume a match procedure that matches the
current next token with its parameter. It
advances the input if it succeeds, and declares
an error if it fails
17Top-Down Parsing by Recursive-Descent
- Pseudocode for match
- procedure match( expectedToken )
- begin
- if token expectedToken then
- getToken
- else
- error
- end
18Top-Down Parsing by Recursive-Descent
- Each reference to a nonterminal on the RHS
becomes a call to a procedure by that name - Each reference to a terminal on the RHS becomes a
call to match with the terminal as argument - So far things are relatively simple and
straightforward - Things are about to change
19Repetition and Choice EBNF
- Consider the simplified BNF syntax for an
if-statement - ifStmt if ( exp ) statement
- if ( exp ) statement else
statement
20Repetition and Choice EBNF
- This can be translated into
- proc ifStmt ()
- begin
- match ( if )
- match ( ( )
- exp()
- match ( ) )
- statement ()
- if token else then
- match ( else )
- statement
- end if
- end ifStmt
21EBNF vs. BNF
- This procedure demonstrates the fact that we
could not distinguish which of the two forms of
if-statement we have until we encounter (or
dont) the else - It corresponds far more precisely to the EBNF
- ifStmt if ( exp ) stmt else stmt
22EBNF vs. BNF
- EBNF notation is designed to mirror the actual
code that one would produce in a
recursive-descent parser! - So, its excellent for our purposes
23EBNF vs. BNF
- Consider the BNF syntax
- exp exp addop term term
- In recursive-descent pseudocode you can see that
youd wind up with infinite recursion - But, if you rephrase this using EBNF
- exp term addop term
- there is no difficulty
24EBNF vs. BNF
- The resulting pseudocode looks like this
- proc exp ()
- begin
- term ()
- while token or token -
- match ( token )
- term ()
- end while
- end exp
25Extending to Semantics
- We need to be able to extend the syntax to
include semantics - And, we want to be certain that arithmetic
operations are left-associative, as expected - Well not handle the syntax portion just now
- But, we can extend the pseudocode as follows
26Extending to Semantics
- function exp() integer
- var tmp integer
- begin
- tmp term()
- while token or token -
- case token of
- match ( ) tmp tmp
term() - - match ( - ) tmp tmp
term() - end case
- end while
- return tmp
- end exp
27Extending to Semantics
- This method of turning an EBNF grammar into code
is very powerful - One can use it to create complete compilers or
complete interpreters
28Extending to Semantics
- One must be careful to set up a collection of
conventions regarding keeping token current, what
match() really does, how getToken() performs,
etc - But, there are no significant challenges or
obstacles to using this approach - Moreover, one can use this approach to create a
syntax tree
29Building a Syntax Tree
- Consider the syntax tree for 345
-
- 5
- 3 4
- The node representing the sum of 3 and 4 must
be created before the node representing its sum
with 5
30Building a Syntax Tree
- We could use the following pseudocode
- function exp () syntaxTree
- var tmp, newTmp syntaxTree
- begin
- tmp term
- while token or token -
- newTmp makeOpNode ( token )
- match ( token )
- leftChild ( newTmp ) tmp
- rightChild ( newTmp ) term
- tmp newTmp
- end while
- return tmp
- end exp
31Building a Syntax Tree
- Weve introduced a new function makeOpNode
that creates a new node (for an operator) - Nodes are assumed to be binary tree nodes, with
room for one piece of data, a left child, and a
right child - The data can be an operator or a value (so
thered likely be tag to distinguish these cases)
32Building a Syntax Tree
- Note that the pseudocode does, indeed, produce a
syntax tree and not a parse tree - The flexibility of the recursive-descent method
that weve described makes it the method of
choice for hand-generated parsers (compilers,
interpreters)
33Some Problems (1)
- First, it may be difficult to translate a BNF
grammar into an equivalent EBNF grammar - You must be certain that the original and
final grammars do, indeed, describe the identical
languages
34Some Problems (2)
- What if you have a production like
- A ? a ß
- where a and ß both begin with
non-terminals? - How can you tell which production is the right
one to use? - The answer to this question requires the
computation of the First sets of a and ß the
set of tokens that can legally begin each string
35Some Problems (3)
- What happens if we have a
e-production? - In this case it may be necessary to know what
tokens can legally come after a nonterminal - This requires the computation of the Follow
set of the nonterminal
36Some Problems (4)
- What about error detection?
- We want to detect incorrect syntax as early as
possible - Wed like to be able to recover from an error and
continue to parse - Further, we may want to attempt to correct an
error if its possible to do so
37Basic LL(1) Parsing
- LL(1) parsing uses an explicit stack rather than
recursive calls to perform a parse - Its helpful to visualize this stack in a
standard way so that the actions of the LL(1)
parser can be seen and discussed
38Basic LL(1) Parsing
- Well use this very simple grammar to illustrate
things - S ? ( S ) S e
- This grammar produces strings of balanced
parentheses - L(S) e, (), ()(), (()),
39Basic LL(1) Parsing
- Input ( )
- bottom of stack EOF after input
40Basic LL(1) Parsing
- The general pattern is
- We start with
- StartSymbol InputString
- ...
-
Accept! - A top-down parser parses by replacing a
nonterminal at the top of the stack by one of
the choices provided by the grammar rules
41Basic LL(1) Parsing
- It selects the correct rule by examining the next
input symbol (the top of the input string stack) - There are two actions
- Replace a nonterminal A at the top of the stack
by a string using a rule - Match a token on the top of the stack with the
next input token remove them
42Basic LL(1) Parsing
- If we want to construct a parse tree as the parse
proceeds, we can add node construction actions as
each nonterminal or terminal is pushed onto the
stack - If we want, we can construct a syntax tree
instead of a parse tree
43LL(1) Parsing Table Algorithm
- Using this parsing method, when a nonterminal A
is at the top of the parsing stack a decision
must be made based on the current input token
(the lookahead token), which grammar rule choice
for A to use when replacing A on the stack
44LL(1) Parsing Table Algorithm
- If a (terminal) token is at the top of the stack
no decision is necessary either it is
identical to the input token and a match occurs
or it isnt identical, and an error occurs
(because the input is incorrect)
45LL(1) Parsing Table Algorithm
- We can express these two choices in a tabular
form by constructing an LL(1) parsing table - This table is a 2-D array indexed by nonterminals
and terminals, and contains production choices to
use at the appropriate parsing step - Well call this table MN,T
46LL(1) Parsing Table Algorithm
- N is the set of nonterminals
- T is the set of terminals (tokens)
- M is a table of moves or actions to take in
order to perform a parse - Well construct the entries for M in a moment
- Any entries that remain empty constitute error
conditions (i.e., indications of bad input)
47Constructing MN,T
- We add entries to M as follows
- If A ? a is a production choice, and there is a
derivation a ? a ß, where a is a token, then
add A ? a to the table at location MA, a - If A ? a is a production choice, and there are
derivations a ? e and S ? ß A a
?, where S is the start symbol and a is a token
(or ), then add A ? a to the table at location
MA, a
48Constructing MN,T
- The ideas behind these rules
- Given a token a in the input, we wish to select a
rule A ? a if it can produce a - If A derives the empty string (via A ? a , and
if a is a token that can legally come after A in
a derivation, then we want to select A ? a to
make A disappear
49Constructing MN,T
- These rules are a bit difficult to carry out by
hand - But, theyre simplified by the construction of
the First and Follow sets that we mentioned
earlier (but have yet to really define)
50Definition
- An LL(1) grammar is one for which the associated
LL(1) parsing table has at most one production in
each table entry - Note that such a grammar is unambiguous
51Example MN,T for ifStmt
52Parsing an ifStmt using MN,T
- Lets watch the parsing process proceed using the
string - if ( 0 ) if ( 1 ) other else other
- Well use some abbreviations
- statement S
- ifStmt I
- elsePart L
- exp E
- if i
- else e
- other o
53Parsing an ifStmt using MN,T
54Left Recursion and Left Factoring
- Repetition and choice in LL(1) parsing suffer
from similar problems to those occurring in
recursive-descent parsing - We solved these problems for recursive-descent
parsing by moving to EBNF notation - We cant use the same technique here we must
rewrite using BNF
55Left Recursion and Left Factoring
- The two standard techniques for solving these
problems are - Left recursion removal
- Left factoring
- Note that there is no guarantee that using these
techniques will result in an LL(1) grammar! - (Similarly, there was no guarantee about using
EBNF to solve the problems)
56Left Recursion and Left Factoring
- But, in practice, these two techniques are very
useful, because theyre very often successful - And, they can be automated
57Left Recursion Removal
- Why is there a problem?
- Because left recursion often is used to make
operations left associative - For example
- exp exp addop term term
- and
- exp exp term
- exp term
- term
58Left Recursion Removal
- These are both examples of direct left recursion
(or, immediate left recursion) - A more difficult case occurs when one has
indirect left recursion - A B c
- B A d
59Removing Immediate Left Recursion
- In the case of immediate left recursion we have
- A A ß ?
- where ß and ? are strings of terminals and
nonterminals, and ß does not begin with A - We rewrite this as a pair of rules
- A ? A
- A ß A e
60Removing Immediate Left Recursion
- For example
- exp exp addop term term
- Becomes
- exp term exp
- exp addop term exp e
61Left Recursion Removal
- The text describes a more general algorithm which
will handle grammars having no e-productions and
no cycles - In practice, no grammars for programming
languages have cycles - But, they may well have e-productions
- (Usually, the e-productions occur in restricted
cases which can be dealt with)
62Left Recursion Removal
- Left recursion removal does not change the
language being recognized, but since the grammar
is changed, the resulting parse trees also are
changed - This may cause complications for the parser
designer and for the resulting compiler
63Left Recursion Removal
- In particular, since the new grammar is not
left associative, creating a corresponding left
associative parse tree becomes somewhat of a
challenge
64Left Recursion Removal
- The challenge is met by passing information from
one portion of the parser to another using
parameters
65Left Factoring
- Left factoring is required when two or more
productions share a common prefix string - For example A a ß a ?
- Heres a concrete example
- stmtSeq stmt stmtSeq stmt
- stmt s
66Left Factoring
- Another concrete example
- ifStmt if ( exp ) stmt
- if ( exp ) stmt else stmt
- An LL(1) parser cannot distinguish between the
alternatives - So, as simple alternative is to factor the a
out on the left and to rewrite the rule as two
rules
67Left Factoring
- So, A a ß a ? becomes
- A a A
- A ß ?
- If we allowed parentheses in BNF, we could
rewrite this as - A a ( ß ? )
- Thats exactly how algebraic factoring appears in
arithmetic
68Left Factoring
- Consider our ifStmt example
- ifStmt if ( exp ) stmt
- if ( exp ) stmt else stmt
- The left factored form is
- ifStmt if ( exp ) stmt elsePart
- elsePart else stmt e
69LL(1) Problem
- Heres a typical case where a grammar for a
programming language fails to be LL(1) because
both assignments and procedure calls begin with
an identifier - Stmt assignStmt callStmt other
- assignStmt identifier exp
- callStmt identifier ( expList )
70LL(1) Problem
- The grammar is not LL(1) because identifier is
shared as the first token of both assignStmt and
callStmt, and thus could be the lookahead token
for either - Worse the grammar is not in a form that can be
left factored - The text shows a solution ugly
71First and Follow Sets
- In order to complete the discussion of LL(1)
parsing we must develop an algorithm that
constructs the LL(1) parsing table - This involves (finally) computing the First
and Follow sets
72First Sets
- Let X be a grammar symbol (terminal or
nonterminal) or e. Then, the set First(X)
consists of terminals (and possibly e) as
follows - If X is a terminal or e, then First(X)
X - If X is a nonterminal
73First Sets
- If X is a nonterminal, then for each production
choice X X1 X2 X3Xn First(X) contains
First(X1) e. Also, if for some i lt n,
all the sets First(X1),,First(Xi) contain e,
then First(X) contains First(Xi1) e. If
all the sets First(Xi),,First(Xn) contain e,
then First(X) also contains e
74First Sets
- We can extend the definition to First(a), where a
is any string of terminals and nonterminals
75First Sets
- Its pretty easy to see how this definition can
be interpreted in the absence of e -productions - Keep adding First(X1) to First(A) for each
nonterminal A and production A X1 until no
further additions occur - This process is called computing the transitive
closure
76First Sets
- If the grammar has e-productions, then the
situation is more complicated, because some of
the nonterminals may disappear - Such a nonterminal is called nullable
- One can find the nullable nonterminals using
transitive closure and then remove them from
First(A)
77Nullable Nonterminals
- Definition A nonterminal A is nullable if there
is a derivation A ? e - Theorem A nonterminal A is nullable iff
First(A) contains e
78Example of First(A)
- Given our simple grammar
- exp exp addop term term
- addop -
- term term mulop factor factor
- mulop
- factor ( exp ) number
- First(exp) (, number
- First(term) (, number
- First(factor) (, number
- First(addop) , -
- First(mulop)
79Follow Sets
- Given a nonterminal A, the set Follow(A),
consisting of terminals (and possibly ), is
defined as follows - If A is the Start symbol, the is in Follow(A)
- If there is a production B ? a A ?, then First(?)
e is in Follow(A) - If there is a production B ? a A ? such that e is
in First(?), then Follow(A) contains Follow(B)
80Follow Sets
- Note that functions as a token in the
calculation of Follow sets - Note that lambda never is an element of a Follow
set - Follow sets only are defined for nonterminals
- Follow sets only contain terminals (just like
First sets)
81Follow Sets
- Lets again examine the grammar
- exp exp addop term
- exp term
- addop
- addop -
- term term mulop factor
- term factor
- mulop
- factor ( exp )
- factor number
82Follow Sets
- First(exp) (, number
- First(term) (, number
- First(factor) (, number
- First(addop) , -
- First(mulop)
- Follow(exp) , , -, )
- Follow(addop) (, number
- Follow(term) , , -, , )
- Follow(mulop) (, number
- Follow(factor) , , -, , )
83Follow Sets for ifStmt
- Consider again the grammar
- stmt ifStmt
- stmt other
- ifStmt if ( exp ) stmt else elsePart
- elsePart else stmt
- elsePart e
- exp 0
- exp 1
84Follow Sets for ifStmt
- First(stmt) if, other
- First(ifStmt) if
- First(elsePart) else, e
- First(exp) 0, 1
- Follow(stmt) , else
- Follow(ifStmt) , else
- Follow(elsePart) , else
- Follow(exp) )
85Constructing the LL(1) Parsing Table MN,T
- We now have a better way to go about
giving the rules for constructing MN,T - For each token a in First(a), add A ? a to
the entry MA, a - If e is in First(a), for each element of a of
Follow(A) (a token or ), add A ? a to MA, a
86LL(k)
- These ideas can be extended to
k-token lookahead - But, the tables get exponentially large
- And, remember, recursive-descent parsing are able
to use lookahead selectively, changing the value
of k dynamically - And, recursive-descent can handle grammars that
are not LL(k) for any k!
87Error Recovery in Top-Down Parsers
- How well the parser responds to syntax errors
frequently determines the usefulness of a
compiler - A parser must, at least, detect whether a program
is syntactically correct - Such a parser is called a recognizer
88Error Recover
- Hopefully a parser will do some amount of error
correction (more properly, error repair) - Most of the time, error repair is limited to
cases that are relatively safe to perform - For example, inserting missing punctuation or
deleting extraneous punctuation
89Error Recovery
- It should be obvious that significant error
repair repair of semantic errors not only is
far beyond the scope of todays compiler in
fact, it theoretically is impossible to
accomplish in the general case - A compiler cannot know a programmers intent it
only can read what a programmer has written
90Minimal Distance Correction
- There are a collection of algorithms that can be
applied to attempt to repair programs where the
correction is performed within some minimal
distance of the detected error - This distance usually is given in terms of some
number of tokens on either side of the error
point
91Minimal Distance Correction
- In practice, even this minimal attempt at error
repair usually is not performed by production
compilers - Compiler writers find themselves challenged far
more than enough just attempting to generate
meaningful error messages
92General Principles
- Here are some general principles that should be
considered - Parsers should determine that an error has
occurred as soon as possible and should indicate
the point of error - After detecting an error, a parser should pick a
likely place to continue parsing - A parser should parse as much code as possible
93General Principles
- Parsers should attempt to avoid the cascading
error problem one error causing subsequent
errors, usually spurious - Parsers should not infinite loop, especially
while issuing warnings and/or error messages - Parsers should issue messages with as much
accuracy and help as possible
94Error Recovery in Recursive-Descent Parsers
- One standard form of error recovery in
recursive-descent parsers is called panic mode - In this mode, parsing is suspended and input
(tokens) are consumed until a recovery point is
identified - Parsing resumes at that point
- In the worst case, the entire rest of the program
is not parsed
95Recovery Points
- Identifying likely recovery points is
extremely difficult but, some general rules
usually work - Recover after the end of a statement (after a
semi-colon) - Recover after the conclusion of a control
structure (like a block0 - Recover after then end of a method, procedure,
structure, class,
96Recovery Points
- Recovery points are identified using pattern
matching you cannot really use the parser
itself, since its whats causing the problem in
the first place
97Recovery Points
- One way to implement this is to provide each
recursive-descent procedure with another argument
a collection of synchronizing tokens - They are used to re-synchronize the parsing
process in the event that a syntax error is
detected - Generally, Follow sets are good candidates for
synchronizing tokens
98Recovery Points
- First sets can provide early detection of a
syntax error