Top-Down Parsing - PowerPoint PPT Presentation

1 / 98
About This Presentation
Title:

Top-Down Parsing

Description:

Title: Analysis & Design of Algorithms Created Date: 1/19/2004 4:38:24 AM Document presentation format: On-screen Show (4:3) Other titles: Verdana Arial Wingdings ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 99
Provided by: elearning2
Category:

less

Transcript and Presenter's Notes

Title: Top-Down Parsing


1
Top-Down Parsing
  • Top-Down Parsing by Recursive-Descent
  • LL(1) Parsing
  • First and Follow Sets
  • Error Recovery in Top-Down Parsers

2
Top-Down Parsing
  • A top-down parsing algorithm parses an input
    string of tokens by tracing out the steps in a
    leftmost derivation
  • This is called top-down because the implied
    traversal of the parse tree is in preorder, and
    thus occurs from the root to the leaves

3
Top-Down Parsing
  • Top-down parsers come in two forms
  • Backtracking parsers
  • Predictive parsers
  • A predictive parser attempts to predict the next
    construction in the input string using one or
    more lookahead tokens
  • A backtracking parser will try different
    possibilities for a parse, backing up an
    arbitrary amount when it finds that it is mistaken

4
Top-Down Parsing
  • Backtracking parsers generally are more powerful
    than predictive ones
  • But, theyre also considerably slower they
    require exponential time to complete a parse
  • This means that backtracking parsers are
    unsuitable for production-grade compilers

5
Top-Down Parsing
  • Well study the to most common forms of top-down,
    predictive parsing
  • Recursive-descent parsing
  • LL(1) parsing
  • Recursive-descent parsing is very versatile, easy
    to implement, and is suitable for generating
    a parser by hand

6
Top-Down Parsing
  • LL(1) parsing is no longer used in practice, but
    it serves as a good introduction to the notions
    well need later those involving bottom-up
    rather than top-down parsing

7
LL(1) Parsing
  • The LL(1) parsing method gets its name from
  • First L process the input from left to right
    (some early parsing techniques processed from
    right to left no longer done today)
  • Second L uses a leftmost derivation for the
    input string
  • The 1 indicates that only one token of input
    is used to predict the direction of the parse

8
Lookahead Sets
  • Both recursive-descent and LL(1) parsing
    generally require the computation of sets called
    First and Follow
  • But, a simple top-down parser can be constructed
    without calculating these sets so, well
    examine this case first

9
Top-Down Parsing by Recursive-Descent
  • The basic idea of recursive-descent is simplicity
    itself!
  • We view a grammar rule for a nonterminal A as a
    definition for a (recursive) procedure that will
    recognize A
  • The RHS of the rule specifies precisely what must
    be done in order to recognize A

10
Top-Down Parsing by Recursive-Descent
  • In other words, the rules of the grammar are the
    specification of a program for recognizing the
    sentences of the language!

11
Top-Down Parsing by Recursive-Descent
  • For example
  • Start able Baker charlie
  • Baker delta
  • These two productions define
  • Start() Baker()
  • Match(able) Match(delta)
  • Baker()
  • Match(charlie)

12
Top-Down Parsing by Recursive-Descent
  • Assume for a moment that Match() is a primitive
    function that calls the scanner
  • It returns normally if it is successful
  • It throws an exception if it is unsuccessful

13
Top-Down Parsing by Recursive-Descent
  • Clearly to make this approach work we need to be
    able to handle
  • Concatenation (done!)
  • Alternation (either or)
  • Repetition (Kleene and )
  • Multiple rules with the same LHS
  • I.e., we need to be able to handle BNF and EBNF
  • Some kind of error recovery would be nice

14
Top-Down Parsing by Recursive-Descent
  • Consider the expression grammar from the previous
    chapter
  • exp exp addop term term
  • addop -
  • term term mulop factor term
  • mulop
  • factor ( exp ) number
  • Consider the rule for factor

15
Top-Down Parsing by Recursive-Descent
  • Heres some pseudocode for factor
  • procedure factor
  • begin
  • case token of
  • ( match( ( )
  • exp
  • match( ) )
  • number
  • match( number )
  • else error
  • end factor

16
Top-Down Parsing by Recursive-Descent
  • It is assumed that there is a token that keeps
    the current next token in the input (so that this
    example uses one symbol of lookahead)
  • We also assume a match procedure that matches the
    current next token with its parameter. It
    advances the input if it succeeds, and declares
    an error if it fails

17
Top-Down Parsing by Recursive-Descent
  • Pseudocode for match
  • procedure match( expectedToken )
  • begin
  • if token expectedToken then
  • getToken
  • else
  • error
  • end

18
Top-Down Parsing by Recursive-Descent
  • Each reference to a nonterminal on the RHS
    becomes a call to a procedure by that name
  • Each reference to a terminal on the RHS becomes a
    call to match with the terminal as argument
  • So far things are relatively simple and
    straightforward
  • Things are about to change

19
Repetition and Choice EBNF
  • Consider the simplified BNF syntax for an
    if-statement
  • ifStmt if ( exp ) statement
  • if ( exp ) statement else
    statement

20
Repetition and Choice EBNF
  • This can be translated into
  • proc ifStmt ()
  • begin
  • match ( if )
  • match ( ( )
  • exp()
  • match ( ) )
  • statement ()
  • if token else then
  • match ( else )
  • statement
  • end if
  • end ifStmt

21
EBNF vs. BNF
  • This procedure demonstrates the fact that we
    could not distinguish which of the two forms of
    if-statement we have until we encounter (or
    dont) the else
  • It corresponds far more precisely to the EBNF
  • ifStmt if ( exp ) stmt else stmt

22
EBNF vs. BNF
  • EBNF notation is designed to mirror the actual
    code that one would produce in a
    recursive-descent parser!
  • So, its excellent for our purposes

23
EBNF vs. BNF
  • Consider the BNF syntax
  • exp exp addop term term
  • In recursive-descent pseudocode you can see that
    youd wind up with infinite recursion
  • But, if you rephrase this using EBNF
  • exp term addop term
  • there is no difficulty

24
EBNF vs. BNF
  • The resulting pseudocode looks like this
  • proc exp ()
  • begin
  • term ()
  • while token or token -
  • match ( token )
  • term ()
  • end while
  • end exp

25
Extending to Semantics
  • We need to be able to extend the syntax to
    include semantics
  • And, we want to be certain that arithmetic
    operations are left-associative, as expected
  • Well not handle the syntax portion just now
  • But, we can extend the pseudocode as follows

26
Extending to Semantics
  • function exp() integer
  • var tmp integer
  • begin
  • tmp term()
  • while token or token -
  • case token of
  • match ( ) tmp tmp
    term()
  • - match ( - ) tmp tmp
    term()
  • end case
  • end while
  • return tmp
  • end exp

27
Extending to Semantics
  • This method of turning an EBNF grammar into code
    is very powerful
  • One can use it to create complete compilers or
    complete interpreters

28
Extending to Semantics
  • One must be careful to set up a collection of
    conventions regarding keeping token current, what
    match() really does, how getToken() performs,
    etc
  • But, there are no significant challenges or
    obstacles to using this approach
  • Moreover, one can use this approach to create a
    syntax tree

29
Building a Syntax Tree
  • Consider the syntax tree for 345
  • 5
  • 3 4
  • The node representing the sum of 3 and 4 must
    be created before the node representing its sum
    with 5

30
Building a Syntax Tree
  • We could use the following pseudocode
  • function exp () syntaxTree
  • var tmp, newTmp syntaxTree
  • begin
  • tmp term
  • while token or token -
  • newTmp makeOpNode ( token )
  • match ( token )
  • leftChild ( newTmp ) tmp
  • rightChild ( newTmp ) term
  • tmp newTmp
  • end while
  • return tmp
  • end exp

31
Building a Syntax Tree
  • Weve introduced a new function makeOpNode
    that creates a new node (for an operator)
  • Nodes are assumed to be binary tree nodes, with
    room for one piece of data, a left child, and a
    right child
  • The data can be an operator or a value (so
    thered likely be tag to distinguish these cases)

32
Building a Syntax Tree
  • Note that the pseudocode does, indeed, produce a
    syntax tree and not a parse tree
  • The flexibility of the recursive-descent method
    that weve described makes it the method of
    choice for hand-generated parsers (compilers,
    interpreters)

33
Some Problems (1)
  • First, it may be difficult to translate a BNF
    grammar into an equivalent EBNF grammar
  • You must be certain that the original and
    final grammars do, indeed, describe the identical
    languages

34
Some Problems (2)
  • What if you have a production like
  • A ? a ß
  • where a and ß both begin with
    non-terminals?
  • How can you tell which production is the right
    one to use?
  • The answer to this question requires the
    computation of the First sets of a and ß the
    set of tokens that can legally begin each string

35
Some Problems (3)
  • What happens if we have a
    e-production?
  • In this case it may be necessary to know what
    tokens can legally come after a nonterminal
  • This requires the computation of the Follow
    set of the nonterminal

36
Some Problems (4)
  • What about error detection?
  • We want to detect incorrect syntax as early as
    possible
  • Wed like to be able to recover from an error and
    continue to parse
  • Further, we may want to attempt to correct an
    error if its possible to do so

37
Basic LL(1) Parsing
  • LL(1) parsing uses an explicit stack rather than
    recursive calls to perform a parse
  • Its helpful to visualize this stack in a
    standard way so that the actions of the LL(1)
    parser can be seen and discussed

38
Basic LL(1) Parsing
  • Well use this very simple grammar to illustrate
    things
  • S ? ( S ) S e
  • This grammar produces strings of balanced
    parentheses
  • L(S) e, (), ()(), (()),

39
Basic LL(1) Parsing
  • Input ( )
  • bottom of stack EOF after input

40
Basic LL(1) Parsing
  • The general pattern is
  • We start with
  • StartSymbol InputString
  • ...

  • Accept!
  • A top-down parser parses by replacing a
    nonterminal at the top of the stack by one of
    the choices provided by the grammar rules

41
Basic LL(1) Parsing
  • It selects the correct rule by examining the next
    input symbol (the top of the input string stack)
  • There are two actions
  • Replace a nonterminal A at the top of the stack
    by a string using a rule
  • Match a token on the top of the stack with the
    next input token remove them

42
Basic LL(1) Parsing
  • If we want to construct a parse tree as the parse
    proceeds, we can add node construction actions as
    each nonterminal or terminal is pushed onto the
    stack
  • If we want, we can construct a syntax tree
    instead of a parse tree

43
LL(1) Parsing Table Algorithm
  • Using this parsing method, when a nonterminal A
    is at the top of the parsing stack a decision
    must be made based on the current input token
    (the lookahead token), which grammar rule choice
    for A to use when replacing A on the stack

44
LL(1) Parsing Table Algorithm
  • If a (terminal) token is at the top of the stack
    no decision is necessary either it is
    identical to the input token and a match occurs
    or it isnt identical, and an error occurs
    (because the input is incorrect)

45
LL(1) Parsing Table Algorithm
  • We can express these two choices in a tabular
    form by constructing an LL(1) parsing table
  • This table is a 2-D array indexed by nonterminals
    and terminals, and contains production choices to
    use at the appropriate parsing step
  • Well call this table MN,T

46
LL(1) Parsing Table Algorithm
  • N is the set of nonterminals
  • T is the set of terminals (tokens)
  • M is a table of moves or actions to take in
    order to perform a parse
  • Well construct the entries for M in a moment
  • Any entries that remain empty constitute error
    conditions (i.e., indications of bad input)

47
Constructing MN,T
  • We add entries to M as follows
  • If A ? a is a production choice, and there is a
    derivation a ? a ß, where a is a token, then
    add A ? a to the table at location MA, a
  • If A ? a is a production choice, and there are
    derivations a ? e and S ? ß A a
    ?, where S is the start symbol and a is a token
    (or ), then add A ? a to the table at location
    MA, a

48
Constructing MN,T
  • The ideas behind these rules
  • Given a token a in the input, we wish to select a
    rule A ? a if it can produce a
  • If A derives the empty string (via A ? a , and
    if a is a token that can legally come after A in
    a derivation, then we want to select A ? a to
    make A disappear

49
Constructing MN,T
  • These rules are a bit difficult to carry out by
    hand
  • But, theyre simplified by the construction of
    the First and Follow sets that we mentioned
    earlier (but have yet to really define)

50
Definition
  • An LL(1) grammar is one for which the associated
    LL(1) parsing table has at most one production in
    each table entry
  • Note that such a grammar is unambiguous

51
Example MN,T for ifStmt
52
Parsing an ifStmt using MN,T
  • Lets watch the parsing process proceed using the
    string
  • if ( 0 ) if ( 1 ) other else other
  • Well use some abbreviations
  • statement S
  • ifStmt I
  • elsePart L
  • exp E
  • if i
  • else e
  • other o

53
Parsing an ifStmt using MN,T
54
Left Recursion and Left Factoring
  • Repetition and choice in LL(1) parsing suffer
    from similar problems to those occurring in
    recursive-descent parsing
  • We solved these problems for recursive-descent
    parsing by moving to EBNF notation
  • We cant use the same technique here we must
    rewrite using BNF

55
Left Recursion and Left Factoring
  • The two standard techniques for solving these
    problems are
  • Left recursion removal
  • Left factoring
  • Note that there is no guarantee that using these
    techniques will result in an LL(1) grammar!
  • (Similarly, there was no guarantee about using
    EBNF to solve the problems)

56
Left Recursion and Left Factoring
  • But, in practice, these two techniques are very
    useful, because theyre very often successful
  • And, they can be automated

57
Left Recursion Removal
  • Why is there a problem?
  • Because left recursion often is used to make
    operations left associative
  • For example
  • exp exp addop term term
  • and
  • exp exp term
  • exp term
  • term

58
Left Recursion Removal
  • These are both examples of direct left recursion
    (or, immediate left recursion)
  • A more difficult case occurs when one has
    indirect left recursion
  • A B c
  • B A d

59
Removing Immediate Left Recursion
  • In the case of immediate left recursion we have
  • A A ß ?
  • where ß and ? are strings of terminals and
    nonterminals, and ß does not begin with A
  • We rewrite this as a pair of rules
  • A ? A
  • A ß A e

60
Removing Immediate Left Recursion
  • For example
  • exp exp addop term term
  • Becomes
  • exp term exp
  • exp addop term exp e

61
Left Recursion Removal
  • The text describes a more general algorithm which
    will handle grammars having no e-productions and
    no cycles
  • In practice, no grammars for programming
    languages have cycles
  • But, they may well have e-productions
  • (Usually, the e-productions occur in restricted
    cases which can be dealt with)

62
Left Recursion Removal
  • Left recursion removal does not change the
    language being recognized, but since the grammar
    is changed, the resulting parse trees also are
    changed
  • This may cause complications for the parser
    designer and for the resulting compiler

63
Left Recursion Removal
  • In particular, since the new grammar is not
    left associative, creating a corresponding left
    associative parse tree becomes somewhat of a
    challenge

64
Left Recursion Removal
  • The challenge is met by passing information from
    one portion of the parser to another using
    parameters

65
Left Factoring
  • Left factoring is required when two or more
    productions share a common prefix string
  • For example A a ß a ?
  • Heres a concrete example
  • stmtSeq stmt stmtSeq stmt
  • stmt s

66
Left Factoring
  • Another concrete example
  • ifStmt if ( exp ) stmt
  • if ( exp ) stmt else stmt
  • An LL(1) parser cannot distinguish between the
    alternatives
  • So, as simple alternative is to factor the a
    out on the left and to rewrite the rule as two
    rules

67
Left Factoring
  • So, A a ß a ? becomes
  • A a A
  • A ß ?
  • If we allowed parentheses in BNF, we could
    rewrite this as
  • A a ( ß ? )
  • Thats exactly how algebraic factoring appears in
    arithmetic

68
Left Factoring
  • Consider our ifStmt example
  • ifStmt if ( exp ) stmt
  • if ( exp ) stmt else stmt
  • The left factored form is
  • ifStmt if ( exp ) stmt elsePart
  • elsePart else stmt e

69
LL(1) Problem
  • Heres a typical case where a grammar for a
    programming language fails to be LL(1) because
    both assignments and procedure calls begin with
    an identifier
  • Stmt assignStmt callStmt other
  • assignStmt identifier exp
  • callStmt identifier ( expList )

70
LL(1) Problem
  • The grammar is not LL(1) because identifier is
    shared as the first token of both assignStmt and
    callStmt, and thus could be the lookahead token
    for either
  • Worse the grammar is not in a form that can be
    left factored
  • The text shows a solution ugly

71
First and Follow Sets
  • In order to complete the discussion of LL(1)
    parsing we must develop an algorithm that
    constructs the LL(1) parsing table
  • This involves (finally) computing the First
    and Follow sets

72
First Sets
  • Let X be a grammar symbol (terminal or
    nonterminal) or e. Then, the set First(X)
    consists of terminals (and possibly e) as
    follows
  • If X is a terminal or e, then First(X)
    X
  • If X is a nonterminal

73
First Sets
  • If X is a nonterminal, then for each production
    choice X X1 X2 X3Xn First(X) contains
    First(X1) e. Also, if for some i lt n,
    all the sets First(X1),,First(Xi) contain e,
    then First(X) contains First(Xi1) e. If
    all the sets First(Xi),,First(Xn) contain e,
    then First(X) also contains e

74
First Sets
  • We can extend the definition to First(a), where a
    is any string of terminals and nonterminals

75
First Sets
  • Its pretty easy to see how this definition can
    be interpreted in the absence of e -productions
  • Keep adding First(X1) to First(A) for each
    nonterminal A and production A X1 until no
    further additions occur
  • This process is called computing the transitive
    closure

76
First Sets
  • If the grammar has e-productions, then the
    situation is more complicated, because some of
    the nonterminals may disappear
  • Such a nonterminal is called nullable
  • One can find the nullable nonterminals using
    transitive closure and then remove them from
    First(A)

77
Nullable Nonterminals
  • Definition A nonterminal A is nullable if there
    is a derivation A ? e
  • Theorem A nonterminal A is nullable iff
    First(A) contains e

78
Example of First(A)
  • Given our simple grammar
  • exp exp addop term term
  • addop -
  • term term mulop factor factor
  • mulop
  • factor ( exp ) number
  • First(exp) (, number
  • First(term) (, number
  • First(factor) (, number
  • First(addop) , -
  • First(mulop)

79
Follow Sets
  • Given a nonterminal A, the set Follow(A),
    consisting of terminals (and possibly ), is
    defined as follows
  • If A is the Start symbol, the is in Follow(A)
  • If there is a production B ? a A ?, then First(?)
    e is in Follow(A)
  • If there is a production B ? a A ? such that e is
    in First(?), then Follow(A) contains Follow(B)

80
Follow Sets
  • Note that functions as a token in the
    calculation of Follow sets
  • Note that lambda never is an element of a Follow
    set
  • Follow sets only are defined for nonterminals
  • Follow sets only contain terminals (just like
    First sets)

81
Follow Sets
  • Lets again examine the grammar
  • exp exp addop term
  • exp term
  • addop
  • addop -
  • term term mulop factor
  • term factor
  • mulop
  • factor ( exp )
  • factor number

82
Follow Sets
  • First(exp) (, number
  • First(term) (, number
  • First(factor) (, number
  • First(addop) , -
  • First(mulop)
  • Follow(exp) , , -, )
  • Follow(addop) (, number
  • Follow(term) , , -, , )
  • Follow(mulop) (, number
  • Follow(factor) , , -, , )

83
Follow Sets for ifStmt
  • Consider again the grammar
  • stmt ifStmt
  • stmt other
  • ifStmt if ( exp ) stmt else elsePart
  • elsePart else stmt
  • elsePart e
  • exp 0
  • exp 1

84
Follow Sets for ifStmt
  • First(stmt) if, other
  • First(ifStmt) if
  • First(elsePart) else, e
  • First(exp) 0, 1
  • Follow(stmt) , else
  • Follow(ifStmt) , else
  • Follow(elsePart) , else
  • Follow(exp) )

85
Constructing the LL(1) Parsing Table MN,T
  • We now have a better way to go about
    giving the rules for constructing MN,T
  • For each token a in First(a), add A ? a to
    the entry MA, a
  • If e is in First(a), for each element of a of
    Follow(A) (a token or ), add A ? a to MA, a

86
LL(k)
  • These ideas can be extended to
    k-token lookahead
  • But, the tables get exponentially large
  • And, remember, recursive-descent parsing are able
    to use lookahead selectively, changing the value
    of k dynamically
  • And, recursive-descent can handle grammars that
    are not LL(k) for any k!

87
Error Recovery in Top-Down Parsers
  • How well the parser responds to syntax errors
    frequently determines the usefulness of a
    compiler
  • A parser must, at least, detect whether a program
    is syntactically correct
  • Such a parser is called a recognizer

88
Error Recover
  • Hopefully a parser will do some amount of error
    correction (more properly, error repair)
  • Most of the time, error repair is limited to
    cases that are relatively safe to perform
  • For example, inserting missing punctuation or
    deleting extraneous punctuation

89
Error Recovery
  • It should be obvious that significant error
    repair repair of semantic errors not only is
    far beyond the scope of todays compiler in
    fact, it theoretically is impossible to
    accomplish in the general case
  • A compiler cannot know a programmers intent it
    only can read what a programmer has written

90
Minimal Distance Correction
  • There are a collection of algorithms that can be
    applied to attempt to repair programs where the
    correction is performed within some minimal
    distance of the detected error
  • This distance usually is given in terms of some
    number of tokens on either side of the error
    point

91
Minimal Distance Correction
  • In practice, even this minimal attempt at error
    repair usually is not performed by production
    compilers
  • Compiler writers find themselves challenged far
    more than enough just attempting to generate
    meaningful error messages

92
General Principles
  • Here are some general principles that should be
    considered
  • Parsers should determine that an error has
    occurred as soon as possible and should indicate
    the point of error
  • After detecting an error, a parser should pick a
    likely place to continue parsing
  • A parser should parse as much code as possible

93
General Principles
  • Parsers should attempt to avoid the cascading
    error problem one error causing subsequent
    errors, usually spurious
  • Parsers should not infinite loop, especially
    while issuing warnings and/or error messages
  • Parsers should issue messages with as much
    accuracy and help as possible

94
Error Recovery in Recursive-Descent Parsers
  • One standard form of error recovery in
    recursive-descent parsers is called panic mode
  • In this mode, parsing is suspended and input
    (tokens) are consumed until a recovery point is
    identified
  • Parsing resumes at that point
  • In the worst case, the entire rest of the program
    is not parsed

95
Recovery Points
  • Identifying likely recovery points is
    extremely difficult but, some general rules
    usually work
  • Recover after the end of a statement (after a
    semi-colon)
  • Recover after the conclusion of a control
    structure (like a block0
  • Recover after then end of a method, procedure,
    structure, class,

96
Recovery Points
  • Recovery points are identified using pattern
    matching you cannot really use the parser
    itself, since its whats causing the problem in
    the first place

97
Recovery Points
  • One way to implement this is to provide each
    recursive-descent procedure with another argument
    a collection of synchronizing tokens
  • They are used to re-synchronize the parsing
    process in the event that a syntax error is
    detected
  • Generally, Follow sets are good candidates for
    synchronizing tokens

98
Recovery Points
  • First sets can provide early detection of a
    syntax error
Write a Comment
User Comments (0)
About PowerShow.com