Parsing - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Parsing

Description:

Ambiguity ... Eliminating Ambiguity (Cont. ... An example of ambiguity in a programming language is the dangling else. Consider ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 76
Provided by: samk153
Category:

less

Transcript and Presenter's Notes

Title: Parsing


1
Parsing
Giuseppe Attardi Università di Pisa
2
Parsing
  • Calculate grammatical structure of program, like
    diagramming sentences, where
  • Tokens words
  • Programs sentences

For further information Aho, Sethi, Ullman,
Compilers Principles, Techniques, and Tools
(a.k.a, the Dragon Book)
3
Outline of coverage
  • Context-free grammars
  • Parsing
  • Tabular Parsing Methods
  • One pass
  • Top-down
  • Bottom-up
  • Yacc

4
Parser extracts grammatical structure of program
function-def
name
arguments
stmt-list
stmt
main
expression
operator
expression
expression
variable
string
ltlt
cout
hello, world\n
5
Context-free languages
  • Grammatical structure defined by context-free
    grammar
  • statement ? labeled-statement
    expression-statement
    compound-statementlabeled-statement ? ident
    statement case
    constant-expression statementcompound-statement
    ? declaration-list
    statement-list

Context-free only one non-terminal in
left-part
terminal
non-terminal
6
Parse trees
  • Parse tree tree labeled with grammar symbols,
    such that
  • If node is labeled A, and its children are
    labeled x1...xn, then there is a productionA
    ??x1...xn
  • Parse tree from A root labeled with A
  • Complete parse tree all leaves labeled with
    tokens

7
Parse trees and sentences
  • Frontier of tree labels on leaves (in
    left-to-right order)
  • Frontier of tree from S is a sentential form
  • Frontier of a complete tree from S is a sentence

8
Example
  • G L ??L E E E ??a b
  • Syntax trees from start symbol (L)

Sentential forms
9
Derivations
  • Alternate definition of sentence
  • Given ?, ? in V, say ??? is a derivation step if
    ??????? and ? ??? , where A ? ??is a
    production
  • ? is a sentential form iff there exists a
    derivation (sequence of derivation steps)
    S??????? ( alternatively, we say that S?? )

Two definitions are equivalent, but note that
there are many derivations corresponding to each
parse tree
10
Another example
  • H L ??E L E E ??a b

L
L
L
L
E

E

L

E
E
b
E
b
a
a
11
Ambiguity
  • For some purposes, it is important to know
    whether a sentence can have more than one parse
    tree
  • A grammar is ambiguous if there is a sentence
    with more than one parse tree
  • Example E ? EE EE id

12
Notes
  • If e then if b then d else f
  • int x y 0
  • A.b.c d
  • Id -gt s s.id
  • E -gt E T -gt E T T -gt T T T -gt id T
    T -gt id T id T -gt id id id T -gt
  • id id id id

13
Ambiguity
  • Ambiguity is a function of the grammar rather
    than the language
  • Certain ambiguous grammars may have equivalent
    unambiguous ones

14
Grammar Transformations
  • Grammars can be transformed without affecting the
    language generated
  • Three transformations are discussed next
  • Eliminating Ambiguity
  • Eliminating Left Recursion (i.e.productions of
    the form A?A ? )
  • Left Factoring

15
Eliminating Ambiguity
  • Sometimes an ambiguous grammar can be rewritten
    to eliminate ambiguity
  • For example, expressions involving additions and
    products can be written as follows
  • E ? E T T
  • T ? T id id
  • The language generated by this grammar is the
    same as that generated by the grammar in slide
    Ambiguity. Both generate id(idid)
  • However, this grammar is not ambiguous

16
Eliminating Ambiguity (Cont.)
  • One advantage of this grammar is that it
    represents the precedence between operators. In
    the parsing tree, products appear nested within
    additions

17
Eliminating Ambiguity (Cont.)
  • An example of ambiguity in a programming language
    is the dangling else
  • Consider
  • S ? if b then S else S if b then S a

18
Eliminating Ambiguity (Cont.)
  • When there are two nested ifs and only one else..

19
Eliminating Ambiguity (Cont.)
  • In most languages (including C and Java), each
    else is assumed to belong to the nearest if that
    is not already matched by an else. This
    association is expressed in the following
    (unambiguous) grammar
  • S ? Matched
  • Unmatched
  • Matched ? if b then Matched else
    Matched
  • a
  • Unmatched ? if b then S
  • if b then
    Matched else Unmatched

20
Eliminating Ambiguity (Cont.)
  • Ambiguity is a property of the grammar
  • It is undecidable whether a context free grammar
    is ambiguous
  • The proof is done by reduction to Posts
    correspondence problem
  • Although there is no general algorithm, it is
    possible to isolate certain constructs in
    productions which lead to ambiguous grammars

21
Eliminating Ambiguity (Cont.)
  • For example, a grammar containing the production
    A?AA ? would be ambiguous, because the
    substring aaa has two parses

A
A
A
A
A
A
A
A
a
A
A
a
a
a
a
a
  • This ambiguity disappears if we use the
    productions
  • A?AB B and B? ?
  • or the productions
  • A?BA B and B? ?.

22
Eliminating Ambiguity (Cont.)
  • Examples of ambiguous productions
  • A?AaA
  • A?aA Ab
  • A?aA aAbA
  • A CF language is inherently ambiguous if it has
    no unambiguous CFG
  • An example of such a language is
  • L aibjcm ij or jm which can be generated
    by the grammar
  • S?AB DC
  • A?aA e C?cC e
  • B?bBc e D?aDb e

23
Elimination of Left Recursion
  • A grammar is left recursive if it has a
    nonterminal A and a derivation A ? Aa for some
    string a.
  • Top-down parsing methods cannot handle
    left-recursive grammars, so a transformation to
    eliminate left recursion is needed
  • Immediate left recursion (productions of the form
    A ? A?) can be easily eliminated
  • Group the A-productions as
  • A ? A?1 A?2 A?m b1 b2 bn
  • where no bi begins with A
  • 2. Replace the A-productions by
  • A ? b1A b2A bnA
  • A ? ?1A ?2A ?mA e

24
Elimination of Left Recursion (Cont.)
  • The previous transformation, however, does not
    eliminate left recursion involving two or more
    steps
  • For example, consider the grammar
  • S ? Aa b
  • A ? Ac Sd e
  • S is left-recursive because S ?Aa?? Sda, but it
    is not immediately left recursive

25
Elimination of Left Recursion (Cont.)
  • Algorithm. Eliminate left recursion
  • Arrange nonterminals in some order A1, A2 ,,, An
  • for i 1 to n
  • for j 1 to i - 1
  • replace each production of the form Ai ? Aj
    g
  • by the production Ai ? d1 g d2 g dn
    g
  • where Aj ? d1 d2 dn are all the
    current Aj-productions
  • eliminate the immediate left recursion among the
    Ai-productions

26
Elimination of Left Recursion (Cont.)
  • To show that the previous algorithm actually
    works, notice that iteration i only changes
    productions with Ai on the left-hand side. And m
    gt i in all productions of the form Ai ? Am ?
  • Induction proof
  • Clearly true for i 1
  • If it is true for all i lt k, then when the outer
    loop is executed for i k, the inner loop will
    remove all productions Ai ? Am? with m lt i
  • Finally, with the elimination of self recursion,
    m in the Ai? Am? productions is forced to be gt i
  • At the end of the algorithm, all derivations of
    the form Ai ? Ama will have m gt i and therefore
    left recursion would not be possible

27
Left Factoring
  • Left factoring helps transform a grammar for
    predictive parsing
  • For example, if we have the two productions
  • S ? if b then S else S
  • if b then S
  • on seeing the input token if, we cannot
    immediately tell which production to choose to
    expand S
  • In general, if we have A ? ?b1 ?b2 and the
    input begins with a, we do not know (without
    looking further) which production to use to
    expand A

28
Left Factoring (Cont.)
  • However, we may defer the decision by expanding A
    to ?A
  • Then after seeing the input derived from ?, we
    may expand A to ?1 or to ?2
  • Left-factored, the original productions become
  • A? ? A
  • A? b1 b2

29
Non-Context-Free Language Constructs
  • Examples of non-context-free languages are
  • L1 wcw w is of the form (ab)
  • L2 anbmcndm n ? 1 and m ? 1
  • L3 anbncn n ? 0
  • Languages similar to these that are context free
  • L1 wcwR w is of the form (ab) (wR stands
    for w reversed)
  • This language is generated by the grammar
  • S? aSa bSb c
  • L2 anbmcmdn n ? 1 and m? 1
  • This language is generated by the grammar
  • S? aSd aAd
  • A? bAc bc

30
Non-Context-Free Language Constructs (Cont.)
  • L2 anbncmdm n ? 1 and m? 1
  • is generated by the grammar
  • S? AB
  • A? aAb ab
  • B? cBd cd
  • L3 anbn n ? 1
  • is generated by the grammar
  • S? aSb ab
  • This language is not definable by any regular
    expression

31
Non-Context-Free Language Constructs (Cont.)
  • Suppose we could construct a DFSM D accepting
    L3.
  • D must have a finite number of states, say k.
  • Consider the sequence of states s0, s1, s2, , sk
    entered by D having read ?, a, aa, , ak.
  • Since D only has k states, two of the states in
    the sequence have to be equal. Say, si ? sj (i ?
    j).
  • From si, a sequence of i bs leads to an accepting
    (final) state. Therefore, the same sequence of i
    bs will also lead to an accepting state from sj.
    Therefore D would accept ajbi which means that
    the language accepted by D is not identical to
    L3. A contradiction.

32
Parsing
  • The parsing problem is Given string of tokens
    w, find a parse tree whose frontier is w.
    (Equivalently, find a derivation from w)
  • A parser for a grammar G reads a list of tokens
    and finds a parse tree if they form a sentence
    (or reports an error otherwise)
  • Two classes of algorithms for parsing
  • Top-down
  • Bottom-up

33
Parser generators
  • A parser generator is a program that reads a
    grammar and produces a parser
  • The best known parser generator is yacc It
    produces bottom-up parsers
  • Most parser generators - including yacc - do not
    work for every CFG they accept a restricted
    class of CFGs that can be parsed efficiently
    using the method employed by that parser generator

34
Top-down parsing
  • Starting from parse tree containing just S, build
    tree down toward input. Expand left-most
    non-terminal.
  • Algorithm (next slide)

35
Top-down parsing (cont.)
  • Let input a1a2...an
  • current sentential form (csf) S
  • loop
  • suppose csf a1akA?
  • based on ak1, choose production
  • A ? ?
  • csf becomes a1ak??

36
Top-down parsing example
  • Grammar H L ??E L E
    E ??a b
  • Input ab
  • Parse tree Sentential form Input

L
ab
EL
ab
aL
ab
37
Top-down parsing example (cont.)
  • Parse tree Sentential form Input

aE
ab
ab
ab
38
LL(1) parsing
  • Efficient form of top-down parsing
  • Use only first symbol of remaining input (ak1)
    to choose next production. That is, employ a
    function M ? ? N? P in choose production step
    of algorithm.
  • When this is possible, grammar is called LL(1)

39
LL(1) examples
  • Example 1
  • H L ??E L E E ??a b
  • Given input ab, so next symbol is a.
  • Which production to use? Cant tell.
  • ? H not LL(1)

40
LL(1) examples
  • Example 2
  • Exp ??Term Exp
  • Exp ? Exp
  • Term ??id
  • (Use for end-of-input symbol.)

Grammar is LL(1) Exp and Term have only one
production Exp has two productions but only
one is applicable at any time.
41
Nonrecursive predictive parsing
  • Maintain a stack explicitly, rather than
    implicitly via recursive calls
  • Key problem during predictive parsing
    determining the production to be applied for a
    non-terminal

42
Nonrecursive predictive parsing
  • Algorithm. Nonrecursive predictive parsing
  • Set ip to point to the first symbol of w.
  • repeat
  • Let X be the top of the stack symbol and a the
    symbol pointed to by ip
  • if X is a terminal or then
  • if X a then
  • pop X from the stack and advance ip
  • else error()
  • else // X is a nonterminal
  • if MX,a X?Y1 Y2 Y k then
  • pop X from the stack
  • push YkY k-1, , Y1 onto the stack with Y1 on
    top
  • (push nothing if Y1 Y2 Y k is ? )
  • output the production X?Y1 Y2 Y k
  • else error()
  • until X

43
LL(1) grammars
  • No left recursion
  • A ?? Aa If this production is chosen, parse
    makes no progress.
  • No common prefixes
  • A ?? ab ag
  • Can fix by left factoring
  • A ?? aA
  • A ? b g

44
LL(1) grammars (cont.)
  • No ambiguity
  • Precise definition requires that production to
    choose be unique (choose function M very hard
    to calculate otherwise)

45
Top-down Parsing
L
Start symbol and root of parse tree
Input tokens ltt0,t1,,ti,...gt
E0 En
L
Input tokens ltti,...gt
E0 En
From left to right, grow the parse tree
downwards
...
46
Checking LL(1)-ness
  • For any sequence of grammar symbols ?, define set
    FIRST(a) ? S to be
  • FIRST(a) a a ? ab for some b

47
LL(1) definition
  • Define Grammar G (N, ?, P, S) is LL(1) iff
    whenever there are two left-most derivations (in
    which the leftmost non-terminal is always
    expanded first)
  • S ? wA? ? w?? ? wtx
  • S ? wA? ? w?? ? wty
  • it follows that ? ?
  • In other words, given
  • 1. a string wA? in V and
  • 2. t, the first terminal symbol to be derived
    from A?
  • there is at most one production that can be
    applied to A to
  • yield a derivation of any terminal string
    beginning with wt
  • FIRST sets can often be calculated by inspection

48
FIRST Sets
Exp ?? Term Exp Exp ? Exp Term
??id (Use for end-of-input symbol)
FIRST() FIRST( Exp) FIRST() ?
FIRST( Exp) ? grammar is LL(1)
49
FIRST Sets
L ??E L EE ??a b

FIRST(E L) a, b FIRST(E) FIRST(E L) ?
FIRST(E) ? ? grammar not LL(1).
50
Computing FIRST Sets
  • Algorithm. Compute FIRST(X) for all grammar
    symbols X
  • forall X ? V do FIRST(X)
  • forall X ? ? (X is a terminal) do FIRST(X) X
  • forall productions X ? ? do FIRST(X) FIRST(X)
    U ?
  • repeat
  • c forall productions X ? Y1Y2 Yk do
  • forall i ? 1,k do
  • FIRST(X) FIRST(X) U (FIRST(Yi) - ?) if
    ? ? FIRST(Yi) then continue c
  • FIRST(X) FIRST(X) U ?
  • until no more terminals or ? are added to any
    FIRST set

51
FIRST Sets of Strings of Symbols
  • FIRST(X1X2Xn) is the union of FIRST(X1) and all
    FIRST(Xi) such that ? ? FIRST(Xk) for k 1, 2,
    , i-1
  • FIRST(X1X2Xn) contains ? iff ? ? FIRST(Xk) for k
    1, 2, , n

52
FIRST Sets do not Suffice
  • Given the productions
  • A ? T x
  • A ? T y T ? w T ? e
  • T? w should be applied when the next input token
    is w.
  • T? e should be applied whenever the next terminal
    is either x or y

53
FOLLOW Sets
  • For any nonterminal X, define the set FOLLOW(X) ?
    S as
  • FOLLOW(X) a S ? aXab

54
Computing the FOLLOW Set
  • Algorithm. Compute FOLLOW(X) for all nonterminals
    X
  • FOLLOW(S)
  • forall productions A ? ?B? do FOLLOW(B)Follow(B)
    ? (FIRST(?) - ?)
  • repeat
  • forall productions A ? ?B or A ? ?B? with ? ?
    FIRST(?) do
  • FOLLOW(B) FOLLOW(B) ? FOLLOW(A)
  • until all FOLLOW sets remain the same

55
Construction of a predictive parsing table
  • Algorithm. Construction of a predictive parsing
    table
  • M,
  • forall productions A ? ? do
  • forall a ? FIRST(?) do
  • MA,a MA,a U A ? ?
  • if ? ? FIRST(?) then
  • forall b ? FOLLOW(A) do
  • MA,b MA,b U A ? ?
  • Make all empty entries of M be error

56
Another Definition of LL(1)
  • Define Grammar G is LL(1) if for every A? N
    with productions A ? a1 . . . an
  • FIRST(ai FOLLOW(A)) ? FIRST(aj FOLLOW(A) )
    for all i, j

57
Regular Languages
  • Definition. A regular grammar is one whose
    productions are all of the type
  • A ? aB
  • A ? a
  • A Regular Expression is either
  • a
  • R1 R2
  • R1 R2
  • R

58
Nondeterministic Finite State Automaton
a
b
b
start
a
0
1
2
3
b
59
Regular Languages
  • Theorem. The classes of languages
  • Generated by a regular grammar
  • Expressed by a regular expression
  • Recognized by a NDFS automaton
  • Recognized by a DFS automaton
  • coincide.

60
Deterministic Finite Automaton
space, tab, new line
START
digit
digit
NUM



KEYWORD
letter
, , -, /, (, )
OPERATOR
61
Scanner code
  • state start
  • loop
  • if no input character buffered then read
    one, and add it to the accumulated token
  • case state of
  • start
  • case input_char of
  • A..Z, a..z state id
  • 0..9 state num
  • else ...
  • end
  • id
  • case input_char of
  • A..Z, a..z state id
  • 0..9 state id
  • else ...
  • end
  • num
  • case input_char of
  • 0..9 ...

62
Table-driven DFA
63
Language Classes
L0
L0
CSL
CFL NPA
LR(1)
LL(1)
RL DFANFA
64
Question
  • Are regular expressions, as provided by Perl or
    other languages, sufficient for parsing nested
    structures, e.g. XML files?

65
Recursive Descent Parser
  • stat ? var expr
  • expr ? term expr
  • term ? factor factor
  • factor ? ( expr ) var constant
  • var ? identifier

66
Scanner
  • public class Scanner
  • private StreamTokenizer input
  • private Type lastToken
  • public enum Type INVALID_CHAR, NO_TOKEN , PLUS,
  • // etc. for remaining tokens, then
  • EOF
  • public Scanner (Reader r)
  • input new StreamTokenizer(r)
  • input.resetSyntax()
  • input.eolIsSignificant(false)
  • input.wordChars('a', 'z')
  • input.wordChars('A', 'Z')
  • input.ordinaryChar('')
  • input.ordinaryChar('')
  • input.ordinaryChar('')
  • input.ordinaryChar('(')

67
Scanner
  • public int nextToken()
  • Type token
  • try
  • switch (input.nextToken())
  • case StreamTokenizer.TT_EOF
  • token EOF
  • break
  • case Type.TT_WORD
  • if (input.sval.equalsIgnoreCase("false"))
  • token FALSE
  • else if (input.sval.equalsIgnoreCase("true"))
  • token TRUE
  • else
  • token VARIABLE
  • break
  • case ''
  • token PLUS
  • break

68
Parser
  • public class Parser
  • private LexicalAnalyzer lexer
  • private Type token
  • public Expr parse(Reader r) throws
    SyntaxException
  • lexer new LexicalAnalyzer(r)
  • nextToken() // assigns token
  • Statement stat statement()
  • expect(LexicalAnalyzer.EOF)
  • return stat

69
Statement
  • // stat variable '' expr ''
  • private Statement stat() throws SyntaxException
  • Expr var variable()
  • expect(LexicalAnalyzer.ASSIGN)
  • Expr exp expr()
  • Statement stat new Statement(var, exp)
  • expect(LexicalAnalyzer.SEMICOLON)
  • return stat

70
Expr
  • // expr term '' expr
  • private Expr expr() throws SyntaxException
  • Expr exp term()
  • while (token LexicalAnalyzer.PLUS)
  • nextToken()
  • exp new Exp(exp, expression())
  • return exp

71
Term
  • // term factor '' term
  • private Expr term() throws SyntaxException
  • Expr exp factor()
  • // Rest of body left as an exercise.

72
Factor
  • // factor ( expr ) var
  • private Expr factor() throws S.Exception
  • Expr exp null
  • if (token LexicalAnalyzer.LEFT_PAREN)
  • nextToken()
  • exp expression()
  • expect(LexicalAnalyzer.RIGHT_PAREN)
  • else
  • exp variable()
  • return exp

73
Variable
  • // variable identifier
  • private Expr variable() throws S.Exception
  • if (token LexicalAnalyzer.ID)
  • Expr exp new Variable(lexer.getString())
  • nextToken()
  • return exp

74
Constant
  • private Expr constantExpression() throws
    S.Exception
  • Expr exp null
  • // Handle the various cases for constant
  • // expressions left as an exercise.
  • return exp

75
Utilities
  • private void expect(Type t) throws
    SyntaxException
  • if (token ! t) // throw SyntaxException...
  • nextToken()
  • private void nextToken()
  • token lexer.nextToken()
Write a Comment
User Comments (0)
About PowerShow.com