Title: Parsing
1Parsing
- Compiler
- Baojian Hua
- bjhua_at_ustc.edu.cn
2Front End
lexical analyzer
source code
tokens
abstract syntax tree
parser
semantic analyzer
IR
3Parsing
- The parser translates the source program into
abstract syntax trees - Token sequence
- from the lexer
- abstract syntax trees
- check validity of programs
- cook compiler internal data structures for
programs - Must take account the program syntax
4Conceptually
parser
token sequence
abstract syntax tree
language syntax
5Syntax Context-free Grammar
- Context-free grammars are (often) given by BNF
expressions (Backus-Naur Form) - read Dragon sec 2.2
- More powerful than RE in theory
- Good for defining language syntax
6Context-free Grammar (CFG)
- A CFG consists of 4 components
- a set of terminals (tokens) T
- a set of nonterminals N
- a set of production rules P
- s -gt t1 t2 tn
- with s?N, and t1, , tn ?(T?N)
- a unique start nonterminal S
7Example
- // Recall the min-ML language in code3
- // (simplified)
- N decs, dec, exp
- T SEMICOLON, VAL, ID, ASSIGN, NUM
- S decs
- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
8Derivation
- A derivation
- Starts with the unique start nonterminal S
- repeatedly replacing a right-hand nonterminal s
by the body of a production rule of the
nonterminal s - stop when right-hand are all terminals
- The final string consists of terminals only and
is called a sentence (program)
9Example
- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
decs -gt (a choice)
derive me
val x 5 val y x
10Example
- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
decs -gt dec SEMICOLON decs -gt VAL ID ASSIGN
exp SEMICOLON decs -gt VAL ID ASSIGN NUM
SEMICOLON decs -gt VAL ID ASSIGN NUM
SEMICOLON dec SEMICOLON decs -gt
-gt VAL ID ASSIGN NUM SEMICOLON VAL
ID ASSIGN ID SEMICOLON decs
derive me
val x 5 val y x
11Another Way to Derive the same Program
- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
decs -gt dec SEMICOLON decs -gt dec SEMICOLON
dec SEMICOLON decs -gt
derive me
val x 5 val y x
12Derivation
- For same string, there may exist many derivations
- left-most derivation
- right-most derivation
- Parsing is the problem of taking a string of
terminals and figure out whether it could be
derived from a CFG - error-detection
13Parse Trees
- Derivation can also be represented as trees
- useful to understand AST (discussed later)
- Idea
- each internal node is labeled with a non-terminal
- each leaf node is labeled with a terminal
- each use of a rule in a derivation explains how
to generate children in the parse tree from the
parents
14Example
- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
decs
SEMI
dec
decs
derive me
VAL
exp
dec
SEMI
decs
ID
val x 5 val y x
5
similar case
15Different Derivations, same Tree
decs -gt dec SEMICOLON decs -gt VAL ID ASSIGN
exp SEMICOLON decs -gt
decs -gt dec SEMICOLON decs -gt dec SEMICOLON
dec SEMICOLON decs -gt
decs
SEMI
dec
decs
derive me
VAL
exp
dec
SEMI
decs
ID
val x 5 val y x
5
similar case
16Parse Tree has Meaningspost-order traversal
decs -gt dec SEMICOLON decs -gt VAL ID ASSIGN
exp SEMICOLON decs -gt
decs -gt dec SEMICOLON decs -gt dec SEMICOLON
dec SEMICOLON decs -gt
decs
SEMI
dec
decs
derive me
VAL
exp
dec
SEMI
decs
ID
val x 5 val y x
5
similar case
17Ambiguous Grammars
- A grammar is ambiguous if the same sequence of
tokens can give rise to two or more different
parse trees
18Example
- exp -gt num
- -gt id
- -gt exp exp
- -gt exp exp
exp -gt exp exp -gt 3 exp -gt 3 exp
exp -gt 3 4 exp -gt 3 4 5
derive me
exp -gt exp exp -gt exp exp exp -gt 3
exp exp -gt 3 4 exp -gt 3 4 5
345
19Example
exp
exp
exp
3
exp
exp
- exp -gt num
- -gt id
- -gt exp exp
- -gt exp exp
4
5
exp -gt exp exp -gt 3 exp -gt 3 exp
exp -gt 3 4 exp -gt 3 4 5
exp
exp
exp
exp -gt exp exp -gt exp exp exp -gt 3
exp exp -gt 3 4 exp -gt 3 4 5
5
exp
exp
3
4
20Ambiguous Grammars
- Problem compilers make use of parse trees to
interpret the meaning of parsed programs - different parse trees have different meanings
- eg 4 5 6 is not (4 5) 6
- languages with ambiguous grammars are DISASTROUS
the meaning of programs isnt well-defined! You
cant tell what your program might do! - Solution rewrite grammar to equivalent forms
21Eliminating ambiguity
- In programming language syntax, ambiguity often
arises from missing operator precedence or
associativity - is of high precedence than
- both and are left-associative
- Why or why not?
- Rewrite grammar to take account of this
22Example
- exp -gt num
- -gt id
- -gt exp exp
- -gt exp exp
exp -gt exp term -gt term term -gt term
factor -gt factor factor -gt num -gt id
Q is the right grammar ambiguous? Why or why not?
23Parser
- A program to check whether a program is derivable
from a given grammar - expensive in general
- must be fast
- to compile a 2000k lines of kernel
- even for small application code
- Theorists have developed specialized kind of
grammar which may be parsed efficiently - LL(k) and LR(k)
24Predictive parsing
- A.K.A Recursive descent parsing, top-down
parsing - simple to code by hand
- efficient
- can parse a large set of grammar
- Key idea
- one (recursive) function for each nonterminal
- one clause for each right-hand production rule
25Example
- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
( step 1 represent tokens ) datatype token
Val Id of string Num of int Assign
Semicolon Eof ( step 2 connect with lexer
) token current ref getToken () fun advance
() current getToken () fun eat (token t)
if !current t then advance () else
error (want , t, but got , !current)
26- decs -gt dec SEMICOLON decs
-
- dec -gt VAL ID ASSIGN exp
- exp -gt ID
- NUM
( step 1 represent tokens ) datatype token
Val Id of string Num of int Assign Semi
Eof ( step 2 connect with lexer ) token
current ref getToken () fun advance ()
current getToken () fun eat (token t) (
step 3 build the parser ) fun parseDecs()
case !current of VAL gt parseDec () eat
(Semi) parseDecs () EOF gt () _ gt
error (want VAL or EOF) fun parseDec () fun
parseExp ()
27Moral
- The key point in predicative parsing is to
determine the production rule to use (recursive
function to call) - must know the start symbols of each rule
- start symbol must not overlap
- ex exp -gt NUM ID
- This motivates the idea of first and follow sets
28Moral
- Current nonterminal is S, and the current input
token is t - if wk starts with t, then choose wk, or
- if wk derives empty string, and the string follow
S starts with t - First symbol sets of wi (1ltiltn) dont overlap
to avoid backtracking
- S -gt w1
- -gt w2
- -gt
- -gt wn
29Nullable, First and Follow sets
- To use predicative parsing, we must compute
- Nullable nonterminals that derive empty string
- First(?) set of terminals that can begin any
string derivable from ? - Follow(X) set of terminals that can immediately
follow any string derivable from nonterminal X - Read Dragon sec 4.4.2 and Tiger sec 3.2
- Fixpoint algorithms
30Nullable, First and Follow sets
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
- Which symbol X, Y and Z can derive empty string?
- What terminals may the string derived from X, Y
and Z begin with? - What terminals may follow X, Y and Z?
31Nullable
- If X can derive an empty string, iff
- base case
- X -gt
- inductive case
- X -gt Y1 Yn
- Y1, , Yn are n nonterminals and may all derive
empty strings
32Computing Nullable
- Nullable lt-
- while (F still change)
- for (each production X -gt a)
- switch (a)
- case ?
- Nullable ? X
- break
- case Y1 Yn
- if (Y1?Nullable Yn?Nullable)
- Nullable ? X
- break
33Example Nullables
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Round 0 1 2
F
34Example Nullables
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Round 0 1 2
F Y, X
35Example Nullables
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Round 0 1 2
F Y, X Y, X
36First(X)
- Set of terminals that X begins with
- X gt a
- Rules
- base case
- X -gt a
- First (X) ? a
- inductive case
- X -gt Y1 Y2 Yn
- First (X) ? First(Y1)
- if Y1?Nullable, First (X) ? First(Y2)
- if Y1,Y2 ?Nullable, First (X) ? First(Y3)
37Computing First
- // Suppose Nullable has been computed
- First(X) lt- // for each X
- while (First still change)
- for (each production X -gt a)
- switch (a)
- case a
- First(X) ? a
- break
- case Y1 Yn
- First(X) ? First(Y1)
- if (Y1\not\in Nullable)
- break
- First(X) ? First(Y1)
- // Similar as above
38Example First
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z)
First(Y)
First(X)
39Example First
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z) d
First(Y) c
First(X) c, a
40Example First
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z) d d, c, a
First(Y) c c
First(X) c, a c, a
41Example First
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z) d d, c, a d, c, a
First(Y) c c c
First(X) c, a c, a c, a
42Parsing with First
Z -gt d d -gt X Y Z a, c, d Y -gt c
c -gt X -gt Y c -gt a
a
Now consider this string d Suppose we choose
the production Z -gt X Y Z But we get stuck at X
-gt Y -gt a neither can accept d! Why?
First(Z) d, c, a
First(Y) c
First(X) c, a
Nullable X, Y
43Follow(X)
- Set of terminals that may follow X
- S gt X a
- Rules
- Base case
- Follow (X)
- inductive case
- Y -gt ?1 X ?2
- Follow(X) ? Fisrt(?2)
- if ?2 is Nullable, Follow(X) ? Follow(Y)
44Computing Follow(X)
- Follow(X) lt-
- while (Follow still change)
- for (each production Y -gt ?1 X ?2 )
- Follow(X) ? First (?2)
- if (?2 is Nullable)
- Follow(X) ? Follow (Y)
45Example Follow
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z) Follow(Z) d, c, a
First(Y) Follow(Y) c
First(X) Follow(X) c, a
46Example Follow
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z) Follow(Z) d, c, a
First(Y) Follow(Y) c d, c, a
First(X) Follow(X) c, a d, c, a
47Example Follow
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z) Follow(Z) d, c, a
First(Y) Follow(Y) c d, c, a d, c, a
First(X) Follow(X) c, a d, c, a d, c, a
48Predicative Parsing Table
- With Nullables, First(), and Follow(), we can
make a parsing table P(N,T) - each entry contains a set of productions
t1 t2 t3 t4
(EOF) N1 ri N2
rk N3 rj
49Predicative Parsing Table
- For each rule X -gt ?
- for each a?First(?), add X -gt ? to P(X, a)
- if X is nullable, add X -gt ? to P(X, b) for each
b ? Follow (X) - all other entries are error
t1 t2 t3 t4
(EOF) N1 r1 N2
rk N3 ri
50Example Predicative Parsing Table
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
a c d
Z Z-gtX Y Z Z-gtX Y Z Z-gtd Z-gtX Y Z
Y Y-gt Y-gtc Y-gt Y-gt
X X-gtY X-gta X-gtY X-gtY
First(X) Follow(X) c, a c, d, a
First(Y) Follow(Y) c c, d, a
First(Z) Follow(Z) d, c, a
51Example Predicative Parsing Table
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
a c d
Z Z-gtX Y Z Z-gtX Y Z Z-gtd Z-gtX Y Z
Y Y-gt Y-gtc Y-gt Y-gt
X X-gtY X-gta X-gtY X-gtY
First(X) Follow(X) c, a c, d, a
First(Y) Follow(Y) c c, d, a
First(Z) Follow(Z) d, c, a
52LL(1)
- A context-free grammar is called LL(1) if it can
be parsed this way - Left-to-right parsing
- Leftmost derivation
- 1 token lookahead
- This means that in the predicative parsing table,
there is at most one production in every entry
53Speeding up set Construction
- All these sets (Nullable, First, Follow) can be
computed simultaneously - see Tiger algorithm 3.13
- Order the computation
- Whats the optimal order to compute these set?
54Example Speeding up set Construction
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
Round 0 1 2 3
First(Z)
First(Y)
First(X)
Q1 Whats reasonable order here?
Q2 How to set this order?
55Directed Graph Model
Z -gt d -gt X Y Z Y -gt c -gt X -gt Y -gt a
Nullable X, Y
d, c, a
c
Y
Z
Q1 Whats reasonable order here?
X
c, a
Q2 How to set this order?
Order Y X Z
56Reverse Topological Sort
- Quasi-topological sort the directed graph
- Quasi topo-sort general directed graph is
impossible - also known as reverse depth-first ordering
- Reverse information (First) flows from
successors to predecessors - Refer to your favorite algorithm book
57Problem
- LL(1) can only be used with grammars in which
every production rules for a nonterminal start
with different terminals - Unfortunately, many grammars dont have this
perfect property
58Example
- exp -gt num
- -gt id
- -gt exp exp
- -gt exp exp
exp -gt exp term -gt term term -gt term
factor -gt factor factor -gt num -gt id
Q is the right grammar LL(1)? Why or why not?
59Solutions
- Left-recursion elimination
- Left-factoring
- Read
- dragon sec4.3.2, 4.3.3, 4.3.4
- tiger sec3.2
60Example
exp -gt term exp exp -gt term exp -gt
term -gt factor term term-gt factor term
-gt factor -gt num -gt id
exp -gt exp term -gt term term -gt term
factor -gt factor factor -gt num -gt id
Q is the right grammar LL(1)? are those two
grammars equivalent?
61LL(k)
- LL(1) can be further generalized to LL(k)
- Left-to-right parsing
- Leftmost derivation
- k token lookahead
- Q table size? other problems with this approach?
62Summary
- Context-free grammar is a math tool for
specifying language syntax - and others
- Writing parsers for general grammar is hard and
costly - LL(k) and LR(k)
- LL(1) grammars can be implemented efficiently
- table-driven algorithms (again!)