Title: Parsing
1Parsing
Giuseppe Attardi Università di Pisa
2Parsing
- Calculate grammatical structure of program, like
diagramming sentences, where - Tokens words
- Programs sentences
For further information Aho, Sethi, Ullman,
Compilers Principles, Techniques, and Tools
(a.k.a, the Dragon Book)
3Outline of coverage
- Context-free grammars
- Parsing
- Tabular Parsing Methods
- One pass
- Top-down
- Bottom-up
- Yacc
4Parser extracts grammatical structure of program
function-def
name
arguments
stmt-list
stmt
main
expression
operator
expression
expression
variable
string
ltlt
cout
hello, world\n
5Context-free languages
- Grammatical structure defined by context-free
grammar - statement ? labeled-statement
expression-statement
compound-statementlabeled-statement ? ident
statement case
constant-expression statementcompound-statement
? declaration-list
statement-list
Context-free only one non-terminal in
left-part
terminal
non-terminal
6Parse trees
- Parse tree tree labeled with grammar symbols,
such that - If node is labeled A, and its children are
labeled x1...xn, then there is a productionA
??x1...xn - Parse tree from A root labeled with A
- Complete parse tree all leaves labeled with
tokens
7Parse trees and sentences
- Frontier of tree labels on leaves (in
left-to-right order) - Frontier of tree from S is a sentential form
- Frontier of a complete tree from S is a sentence
8Example
- G L ??L E E E ??a b
- Syntax trees from start symbol (L)
Sentential forms
9Derivations
- Alternate definition of sentence
- Given ?, ? in V, say ??? is a derivation step if
??????? and ? ??? , where A ? ??is a
production - ? is a sentential form iff there exists a
derivation (sequence of derivation steps)
S??????? ( alternatively, we say that S?? )
Two definitions are equivalent, but note that
there are many derivations corresponding to each
parse tree
10Another example
L
L
L
L
E
E
L
E
E
b
E
b
a
a
11Ambiguity
- For some purposes, it is important to know
whether a sentence can have more than one parse
tree - A grammar is ambiguous if there is a sentence
with more than one parse tree - Example E ? EE EE id
12Notes
- If e then if b then d else f
- int x y 0
- A.b.c d
- Id -gt s s.id
- E -gt E T -gt E T T -gt T T T -gt id T
T -gt id T id T -gt id id id T -gt - id id id id
13Ambiguity
- Ambiguity is a function of the grammar rather
than the language - Certain ambiguous grammars may have equivalent
unambiguous ones
14Grammar Transformations
- Grammars can be transformed without affecting the
language generated - Three transformations are discussed next
- Eliminating Ambiguity
- Eliminating Left Recursion (i.e.productions of
the form A?A ? ) - Left Factoring
15Eliminating Ambiguity
- Sometimes an ambiguous grammar can be rewritten
to eliminate ambiguity - For example, expressions involving additions and
products can be written as follows - E ? E T T
- T ? T id id
- The language generated by this grammar is the
same as that generated by the grammar in slide
Ambiguity. Both generate id(idid) - However, this grammar is not ambiguous
16Eliminating Ambiguity (Cont.)
- One advantage of this grammar is that it
represents the precedence between operators. In
the parsing tree, products appear nested within
additions
17Eliminating Ambiguity (Cont.)
- An example of ambiguity in a programming language
is the dangling else - Consider
- S ? if b then S else S if b then S a
18Eliminating Ambiguity (Cont.)
- When there are two nested ifs and only one else..
19Eliminating Ambiguity (Cont.)
- In most languages (including C and Java), each
else is assumed to belong to the nearest if that
is not already matched by an else. This
association is expressed in the following
(unambiguous) grammar -
- S ? Matched
- Unmatched
- Matched ? if b then Matched else
Matched - a
- Unmatched ? if b then S
- if b then
Matched else Unmatched
20Eliminating Ambiguity (Cont.)
- Ambiguity is a property of the grammar
- It is undecidable whether a context free grammar
is ambiguous - The proof is done by reduction to Posts
correspondence problem - Although there is no general algorithm, it is
possible to isolate certain constructs in
productions which lead to ambiguous grammars
21Eliminating Ambiguity (Cont.)
- For example, a grammar containing the production
A?AA ? would be ambiguous, because the
substring aaa has two parses
A
A
A
A
A
A
A
A
a
A
A
a
a
a
a
a
- This ambiguity disappears if we use the
productions - A?AB B and B? ?
- or the productions
- A?BA B and B? ?.
22Eliminating Ambiguity (Cont.)
- Examples of ambiguous productions
- A?AaA
- A?aA Ab
- A?aA aAbA
- A CF language is inherently ambiguous if it has
no unambiguous CFG - An example of such a language is
- L aibjcm ij or jm which can be generated
by the grammar - S?AB DC
- A?aA e C?cC e
- B?bBc e D?aDb e
23Elimination of Left Recursion
- A grammar is left recursive if it has a
nonterminal A and a derivation A ? Aa for some
string a. - Top-down parsing methods cannot handle
left-recursive grammars, so a transformation to
eliminate left recursion is needed - Immediate left recursion (productions of the form
A ? A?) can be easily eliminated - Group the A-productions as
- A ? A?1 A?2 A?m b1 b2 bn
- where no bi begins with A
- 2. Replace the A-productions by
- A ? b1A b2A bnA
- A ? ?1A ?2A ?mA e
24Elimination of Left Recursion (Cont.)
- The previous transformation, however, does not
eliminate left recursion involving two or more
steps - For example, consider the grammar
- S ? Aa b
- A ? Ac Sd e
- S is left-recursive because S ?Aa?? Sda, but it
is not immediately left recursive
25Elimination of Left Recursion (Cont.)
- Algorithm. Eliminate left recursion
- Arrange nonterminals in some order A1, A2 ,,, An
- for i 1 to n
- for j 1 to i - 1
- replace each production of the form Ai ? Aj
g - by the production Ai ? d1 g d2 g dn
g - where Aj ? d1 d2 dn are all the
current Aj-productions -
- eliminate the immediate left recursion among the
Ai-productions
26Elimination of Left Recursion (Cont.)
- To show that the previous algorithm actually
works, notice that iteration i only changes
productions with Ai on the left-hand side. And m
gt i in all productions of the form Ai ? Am ? - Induction proof
- Clearly true for i 1
- If it is true for all i lt k, then when the outer
loop is executed for i k, the inner loop will
remove all productions Ai ? Am? with m lt i - Finally, with the elimination of self recursion,
m in the Ai? Am? productions is forced to be gt i - At the end of the algorithm, all derivations of
the form Ai ? Ama will have m gt i and therefore
left recursion would not be possible
27Left Factoring
- Left factoring helps transform a grammar for
predictive parsing - For example, if we have the two productions
- S ? if b then S else S
- if b then S
- on seeing the input token if, we cannot
immediately tell which production to choose to
expand S - In general, if we have A ? ?b1 ?b2 and the
input begins with a, we do not know (without
looking further) which production to use to
expand A
28Left Factoring (Cont.)
- However, we may defer the decision by expanding A
to ?A - Then after seeing the input derived from ?, we
may expand A to ?1 or to ?2 - Left-factored, the original productions become
- A? ? A
- A? b1 b2
29Non-Context-Free Language Constructs
- Examples of non-context-free languages are
- L1 wcw w is of the form (ab)
- L2 anbmcndm n ? 1 and m ? 1
- L3 anbncn n ? 0
- Languages similar to these that are context free
- L1 wcwR w is of the form (ab) (wR stands
for w reversed) - This language is generated by the grammar
- S? aSa bSb c
- L2 anbmcmdn n ? 1 and m? 1
- This language is generated by the grammar
- S? aSd aAd
- A? bAc bc
30Non-Context-Free Language Constructs (Cont.)
- L2 anbncmdm n ? 1 and m? 1
- is generated by the grammar
- S? AB
- A? aAb ab
- B? cBd cd
- L3 anbn n ? 1
- is generated by the grammar
- S? aSb ab
- This language is not definable by any regular
expression
31Non-Context-Free Language Constructs (Cont.)
- Suppose we could construct a DFSM D accepting
L3. - D must have a finite number of states, say k.
- Consider the sequence of states s0, s1, s2, , sk
entered by D having read ?, a, aa, , ak. - Since D only has k states, two of the states in
the sequence have to be equal. Say, si ? sj (i ?
j). - From si, a sequence of i bs leads to an accepting
(final) state. Therefore, the same sequence of i
bs will also lead to an accepting state from sj.
Therefore D would accept ajbi which means that
the language accepted by D is not identical to
L3. A contradiction.
32Parsing
- The parsing problem is Given string of tokens
w, find a parse tree whose frontier is w.
(Equivalently, find a derivation from w) - A parser for a grammar G reads a list of tokens
and finds a parse tree if they form a sentence
(or reports an error otherwise) - Two classes of algorithms for parsing
- Top-down
- Bottom-up
33Parser generators
- A parser generator is a program that reads a
grammar and produces a parser - The best known parser generator is yacc It
produces bottom-up parsers - Most parser generators - including yacc - do not
work for every CFG they accept a restricted
class of CFGs that can be parsed efficiently
using the method employed by that parser generator
34Top-down parsing
- Starting from parse tree containing just S, build
tree down toward input. Expand left-most
non-terminal. - Algorithm (next slide)
35Top-down parsing (cont.)
- Let input a1a2...an
- current sentential form (csf) S
- loop
- suppose csf a1akA?
- based on ak1, choose production
- A ? ?
- csf becomes a1ak??
36Top-down parsing example
- Grammar H L ??E L E
E ??a b - Input ab
- Parse tree Sentential form Input
L
ab
EL
ab
aL
ab
37Top-down parsing example (cont.)
- Parse tree Sentential form Input
aE
ab
ab
ab
38LL(1) parsing
- Efficient form of top-down parsing
- Use only first symbol of remaining input (ak1)
to choose next production. That is, employ a
function M ? ? N? P in choose production step
of algorithm. - When this is possible, grammar is called LL(1)
39LL(1) examples
- Example 1
- H L ??E L E E ??a b
- Given input ab, so next symbol is a.
- Which production to use? Cant tell.
- ? H not LL(1)
40LL(1) examples
- Example 2
- Exp ??Term Exp
- Exp ? Exp
- Term ??id
- (Use for end-of-input symbol.)
Grammar is LL(1) Exp and Term have only one
production Exp has two productions but only
one is applicable at any time.
41Nonrecursive predictive parsing
- Maintain a stack explicitly, rather than
implicitly via recursive calls - Key problem during predictive parsing
determining the production to be applied for a
non-terminal
42Nonrecursive predictive parsing
- Algorithm. Nonrecursive predictive parsing
- Set ip to point to the first symbol of w.
- repeat
- Let X be the top of the stack symbol and a the
symbol pointed to by ip - if X is a terminal or then
- if X a then
- pop X from the stack and advance ip
- else error()
- else // X is a nonterminal
- if MX,a X?Y1 Y2 Y k then
- pop X from the stack
- push YkY k-1, , Y1 onto the stack with Y1 on
top - (push nothing if Y1 Y2 Y k is ? )
- output the production X?Y1 Y2 Y k
- else error()
- until X
43LL(1) grammars
- No left recursion
- A ?? Aa If this production is chosen, parse
makes no progress. - No common prefixes
- A ?? ab ag
- Can fix by left factoring
- A ?? aA
- A ? b g
44LL(1) grammars (cont.)
- No ambiguity
- Precise definition requires that production to
choose be unique (choose function M very hard
to calculate otherwise)
45Top-down Parsing
L
Start symbol and root of parse tree
Input tokens ltt0,t1,,ti,...gt
E0 En
L
Input tokens ltti,...gt
E0 En
From left to right, grow the parse tree
downwards
...
46Checking LL(1)-ness
- For any sequence of grammar symbols ?, define set
FIRST(a) ? S to be - FIRST(a) a a ? ab for some b
47LL(1) definition
- Define Grammar G (N, ?, P, S) is LL(1) iff
whenever there are two left-most derivations (in
which the leftmost non-terminal is always
expanded first) - S ? wA? ? w?? ? wtx
- S ? wA? ? w?? ? wty
- it follows that ? ?
- In other words, given
- 1. a string wA? in V and
- 2. t, the first terminal symbol to be derived
from A? - there is at most one production that can be
applied to A to - yield a derivation of any terminal string
beginning with wt - FIRST sets can often be calculated by inspection
48FIRST Sets
Exp ?? Term Exp Exp ? Exp Term
??id (Use for end-of-input symbol)
FIRST() FIRST( Exp) FIRST() ?
FIRST( Exp) ? grammar is LL(1)
49FIRST Sets
L ??E L EE ??a b
FIRST(E L) a, b FIRST(E) FIRST(E L) ?
FIRST(E) ? ? grammar not LL(1).
50Computing FIRST Sets
- Algorithm. Compute FIRST(X) for all grammar
symbols X - forall X ? V do FIRST(X)
- forall X ? ? (X is a terminal) do FIRST(X) X
- forall productions X ? ? do FIRST(X) FIRST(X)
U ? - repeat
- c forall productions X ? Y1Y2 Yk do
- forall i ? 1,k do
- FIRST(X) FIRST(X) U (FIRST(Yi) - ?) if
? ? FIRST(Yi) then continue c - FIRST(X) FIRST(X) U ?
- until no more terminals or ? are added to any
FIRST set
51FIRST Sets of Strings of Symbols
- FIRST(X1X2Xn) is the union of FIRST(X1) and all
FIRST(Xi) such that ? ? FIRST(Xk) for k 1, 2,
, i-1 - FIRST(X1X2Xn) contains ? iff ? ? FIRST(Xk) for k
1, 2, , n
52FIRST Sets do not Suffice
- Given the productions
- A ? T x
- A ? T y T ? w T ? e
- T? w should be applied when the next input token
is w. - T? e should be applied whenever the next terminal
is either x or y
53FOLLOW Sets
- For any nonterminal X, define the set FOLLOW(X) ?
S as - FOLLOW(X) a S ? aXab
54Computing the FOLLOW Set
- Algorithm. Compute FOLLOW(X) for all nonterminals
X - FOLLOW(S)
- forall productions A ? ?B? do FOLLOW(B)Follow(B)
? (FIRST(?) - ?) - repeat
- forall productions A ? ?B or A ? ?B? with ? ?
FIRST(?) do - FOLLOW(B) FOLLOW(B) ? FOLLOW(A)
- until all FOLLOW sets remain the same
55Construction of a predictive parsing table
- Algorithm. Construction of a predictive parsing
table - M,
- forall productions A ? ? do
- forall a ? FIRST(?) do
- MA,a MA,a U A ? ?
- if ? ? FIRST(?) then
- forall b ? FOLLOW(A) do
- MA,b MA,b U A ? ?
- Make all empty entries of M be error
56Another Definition of LL(1)
- Define Grammar G is LL(1) if for every A? N
with productions A ? a1 . . . an - FIRST(ai FOLLOW(A)) ? FIRST(aj FOLLOW(A) )
for all i, j
57Regular Languages
- Definition. A regular grammar is one whose
productions are all of the type - A ? aB
- A ? a
- A Regular Expression is either
- a
- R1 R2
- R1 R2
- R
58Nondeterministic Finite State Automaton
a
b
b
start
a
0
1
2
3
b
59Regular Languages
- Theorem. The classes of languages
- Generated by a regular grammar
- Expressed by a regular expression
- Recognized by a NDFS automaton
- Recognized by a DFS automaton
- coincide.
60Deterministic Finite Automaton
space, tab, new line
START
digit
digit
NUM
KEYWORD
letter
, , -, /, (, )
OPERATOR
61Scanner code
- state start
- loop
- if no input character buffered then read
one, and add it to the accumulated token - case state of
- start
- case input_char of
- A..Z, a..z state id
- 0..9 state num
- else ...
- end
- id
- case input_char of
- A..Z, a..z state id
- 0..9 state id
- else ...
- end
- num
- case input_char of
- 0..9 ...
62Table-driven DFA
63Language Classes
L0
L0
CSL
CFL NPA
LR(1)
LL(1)
RL DFANFA
64Question
- Are regular expressions, as provided by Perl or
other languages, sufficient for parsing nested
structures, e.g. XML files?
65Recursive Descent Parser
- stat ? var expr
- expr ? term expr
- term ? factor factor
- factor ? ( expr ) var constant
- var ? identifier
66Scanner
- public class Scanner
- private StreamTokenizer input
- private Type lastToken
- public enum Type INVALID_CHAR, NO_TOKEN , PLUS,
- // etc. for remaining tokens, then
- EOF
-
- public Scanner (Reader r)
- input new StreamTokenizer(r)
- input.resetSyntax()
- input.eolIsSignificant(false)
- input.wordChars('a', 'z')
- input.wordChars('A', 'Z')
- input.ordinaryChar('')
- input.ordinaryChar('')
- input.ordinaryChar('')
- input.ordinaryChar('(')
67Scanner
- public int nextToken()
- Type token
- try
- switch (input.nextToken())
- case StreamTokenizer.TT_EOF
- token EOF
- break
- case Type.TT_WORD
- if (input.sval.equalsIgnoreCase("false"))
- token FALSE
- else if (input.sval.equalsIgnoreCase("true"))
- token TRUE
- else
- token VARIABLE
- break
- case ''
- token PLUS
- break
68Parser
- public class Parser
- private LexicalAnalyzer lexer
- private Type token
- public Expr parse(Reader r) throws
SyntaxException - lexer new LexicalAnalyzer(r)
- nextToken() // assigns token
- Statement stat statement()
- expect(LexicalAnalyzer.EOF)
- return stat
-
69Statement
- // stat variable '' expr ''
- private Statement stat() throws SyntaxException
- Expr var variable()
- expect(LexicalAnalyzer.ASSIGN)
- Expr exp expr()
- Statement stat new Statement(var, exp)
- expect(LexicalAnalyzer.SEMICOLON)
- return stat
-
70Expr
- // expr term '' expr
- private Expr expr() throws SyntaxException
- Expr exp term()
- while (token LexicalAnalyzer.PLUS)
- nextToken()
- exp new Exp(exp, expression())
-
- return exp
-
71Term
- // term factor '' term
- private Expr term() throws SyntaxException
- Expr exp factor()
- // Rest of body left as an exercise.
72Factor
- // factor ( expr ) var
- private Expr factor() throws S.Exception
- Expr exp null
- if (token LexicalAnalyzer.LEFT_PAREN)
- nextToken()
- exp expression()
- expect(LexicalAnalyzer.RIGHT_PAREN)
- else
- exp variable()
-
- return exp
-
73Variable
- // variable identifier
- private Expr variable() throws S.Exception
- if (token LexicalAnalyzer.ID)
- Expr exp new Variable(lexer.getString())
- nextToken()
- return exp
-
-
74Constant
- private Expr constantExpression() throws
S.Exception - Expr exp null
- // Handle the various cases for constant
- // expressions left as an exercise.
- return exp
-
75Utilities
- private void expect(Type t) throws
SyntaxException - if (token ! t) // throw SyntaxException...
-
- nextToken()
-
- private void nextToken()
- token lexer.nextToken()
-
-