Title: Compilers Modern Compiler Design
1CompilersModern Compiler Design
Introduction to parsing methods Creating a
top-down parser manually Creating a top-down
parser automatically Creating a bottom-up parser
automatically Parser Generator Tools
NCYU C. H. Wang
2Introduction
- Context-free Grammar
- The syntax of programming language constructs can
be described by context-free grammar - Important aspects
- A grammar serves to impose a structure on the
linear sequence of tokens which is the program. - Using techniques from the field of formal
languages, a grammar can be employed to construct
a parser for it automatically. - Grammars aid programmers to write syntactically
correct programs and provide answer to detailed
questions about the syntax.
3Definitions of CFG
- A context-free grammar consists of terminals,
nonterminals, a start symbol and productions. - Terminals are the basic symbols from which
strings are formed. - Nonterminals are syntactic variables that denote
sets of strings. - In a grammar, one nonterminal is distinguished as
the start symbol, and the set of strings it
denotes is the language defined by the grammar. - The productions of a grammar specify the manner
in which the terminals and nonterminals can be
combined to form strings. Each production
consists of a nonterminal, followed by an arrow,
followed by a string of onterminals and terminals.
4The role of the parser
5Two approaches
- Deterministic left-to-right top-down
- LL method
- Deterministic left-to-right bottom-up
- LR method
- Left-to-right
- The sequence of tokens is processed from left to
right - Deterministic
- No searching is involved each token brings the
parser one step closer to the goal of
constructing the syntax tree
6Speed issue
- The deterministic parsing methods require an
amount of time that is a linear function of the
length of the input they are linear-time method. - A grammar copied as is from a language manual
has a very small chance of leading to a
deterministic method, unless of course the
language designer has taken pains to make the
grammar match such a method. - Allowing some searching to take place
- The algorithm can handle all grammars
- These algorithms are no longer linear-time
7Non-ambiguous
- A grammar for which a deterministic parser can be
generated is guaranteed to be non-ambiguous - Since an arbitrary grammar will often fail to
match one of standard parsing methods, it is
important to have techniques to transform the
grammar to non-ambiguous form. - We will assume that the grammar of the
programming language is non-ambiguous. - That implies that to each input program there
belongs either one syntax tree or no syntax tree
(the program contains one or more errors)
8Two classes of parsing methods
9Pre-order and post-order (1)
- The top-down method constructs the syntax tree in
pre-order - The bottom-up method constructs the syntax tree
in post-order
10Pre-order and post-order (2)
11Principles of top-down parsing
- The main task of a top-down parser is to choose
the correct alternatives for known non-terminals
12Principles of bottom-up parsing
- The main task of a bottom-up parser is to
repeatedly find the first node all of whose
children have already been constructed.
13Error detection and error recovery
- The position at which the error is detected my be
unrelated to the position of the actual error the
user made. - Example
- x a(pq( - b(r-s)
14Error recovery
- Two strategies
- Error correction
- Modifies the input token stream and/or the
parsers internal state so that parsing can
continue - Non-correcting error recovery
- Does not modify the input stream, but rather
discards all parser information and continues
parsing the rest of the program with a grammar
for rest of the program. (called suffix grammar)
15Creating a top-down parser manually
- Recursive descent parsing
- Simplest way but has its limitations
16Recursive descent parsing program (1)
17Recursive descent parsing program (2)
18Drawbacks
- Three drawbacks
- There is still some searching through the
alternatives - The method often fails to produce a correct
parser - Error handling leaves much to be desired
19Second problems (1)
- Example 1
- Index_element will never be tried
- IDENTIFIER
20Second problems (2)
- Example 2
- The recognizer will not recognize ab
21Second problems (3)
- Example 3
- Recursive descent parsers cannot handle
left-recursive grammars
22Creating a top-down parser automatically
- The principles of constructing a top-down parser
automatically derive from those of writing one by
hand, by applying precomputation. - Grammars which allow the construction of a
top-down parser to be performed are called LL(1)
grammars.
23LL(1) parsing
- FIRST set
- The sets of first tokens produced by all
alternatives in the grammar. - We have to precompute the FIRST sets of all
non-terminals - The first sets of the terminals are obvious.
- Finding FIRST(?) is trivial when ? starts with a
terminal. - FIRST(N) is the union of the FIRST sets of its
alternatives.
24Predictive recursive descent parser
- The FIRST sets can be used in the construction of
a predictive parser because it predicts the
presence of a given alternative without trying to
find out if it is there.
25Closure algorithm for computing the FIRST set (1)
26Closure algorithm for computing the FIRST set (2)
27Closure algorithm for computing the FIRST set (3)
28FIRST sets example(1)
29FIRST sets example(2)
30FIRST sets example(3)
31The predictive parser (1)
32The predictive parser (2)
33Practice
- Find the FIRST sets of all alternative of the
following grammar. - E -gt TE
- E-gtTE?
- T-gtFT
- T-gtFT?
- F-gt(E)id
34Nullable alternatives
- A complication arises with the case label for the
empty alternative (ex. rest_expression). Since it
does not itself start with any token, how can we
decide whether it is the correct alternative?
35FOLLOW sets
- Follow sets
- Determining the set of tokens that can
immediately follow a given non-terminal N. - LL(1) parser
- LL because the parser works from Left to right
identifying the nodes in what is called Leftmost
derivation order. - (1) because all choices are based on a one
token look-ahead.
36Closure algorithm for computing the FOLLOW sets
37The first and follow sets
38Recall the predictive parser
rest_expression ? expression ?
FIRST(rest_expr) , ?
void rest_expression(void) switch
(Token.class) case ''
token('') expression() break case EOF
case ')' break default
error()
FOLLOW(rest_expr) EOF, )
39LL(1) conflicts
40LL(1) conflicts
- FIRST/FIRST conflict
- term ? IDENTIFIER
- IDENTIFIER expression
- ( expression )
41LL(1) conflicts
- FIRST/FOLLOW conflict
- FIRST set FOLLOW set
- S ? A a b a
- A ? a ? a, ? a
42LL(1) conflicts
- left recursion
- expression ? expression - term term
- Look-ahead token
- LL(1) method predicts the alternative Ak for a
non-terminal N - FIRST(Ak) ? (if is nullable then FOLLOW(N))
- LL(1) grammar
- No FIRST/FIRST conflicts
- No FIRST/FOLLOW conflicts
- No multiple nullable alternatives
- No non-terminal can have more than one nullable
alternative.
43Solve the LL(1) conflicts
- Two options
- Use a stronger parser
- Make the grammar LL(1)
44Making a grammar LL(1)
- manual labour
- rewrite grammar
- adjust semantic actions
- three rewrite methods
- left factoring
- substitution
- left-recursion removal
45Left-factoring
- term ? IDENTIFIER
- IDENTIFIER expression
- factor out common prefix
- term ? IDENTIFIER after_identifier
- after_identifier ? ? expression
? FOLLOW(after_identifier)
46Substitution
- A ? a B c ?
- S ? p A q
- replace non-terminal by its alternative
- S ? p a q p B c q p q
- Example
- S ? A a b
- A ? a ?
- replace non-terminal by its alternative
- S ? a a b a b
47Left-recursion removal
- Three types of left-recursion
- Direct left-recursion
- N ? N?
- Indirect left-recursion
- Chain structure
- N ? A
- A ? B
-
- Z ? N
- Hidden left-recursion
- N ? ? N (? can produce ?)
48Left-recursion removal
- N ? N ? ?
- replace by
- N ? ? M
- M ? ? M ?
- example
- expression ? expression - term term
? ? ? ? ? ? ? ? ? ? ...
expression ? term expression_tail_option expressio
n_tail_option ? - term expression_tail_option
?
49Practice
- make the following grammar LL(1)
- expression ? expression term expression -
term term - term ? term factor term / factor factor
- factor ? ( expression ) func-call
identifier constant - func-call ? identifier ( expr-list? )
- expr-list ? expression (, expression)
50Answers
- substitution
- F ? ( E ) ID ( expr-list? ) ID
constant - left factoring
- E ? E ( - ) T T
- T ? T ( / ) F F
- F ? ( E ) ID ( ( expr-list? ) )?
constant - left recursion removal
- E ? T (( - ) T )
- T ? F (( / ) F )
51Undoing the semantic effects of grammar
transformations
- While it is often possible to transform our
grammar into a new grammar that is acceptable by
a parser generator and that generates the same
language, the new grammar usually assigns a
different structure to strings in the language
than our original grammar did - Fortunately, in many cases we are not really
interested in the structure but rather in the
semantics implied by it.
52Semantics
Non-left-recursive equivalent
53Automatic conflict resolution (1)
- There are two ways in which LL parsers can be
strengthened - By increasing the look-ahead
- Distinguishing alternatives not by their first
token but by their first two tokens is called
LL(2). - Disadvantages the parser code can get much
bigger. - By allowing dynamic conflict resolvers
- When the conflict arises during parsing, some of
conditions are evaluated to solve it. - The parser generator LLgen requires a conflict
resolver to be placed on the first of two
conflicting alternatives.
54Automatic conflict resolution (2)
- If-else statement in C
- else_tail_option both FIRST set and FOLLOW set
contain the token else - Conflict resolver
55The LL(1) push-down automation
- Transition table for an LL(1) parser
56Push-down automation (PDA)
- Type of moves
- Prediction move
- Top of the prediction stack is a non-terminal N.
- N is removed from the stack
- Look up the prediction table
- Push the alternative of N into the prediction
stack - Match move
- Top of the prediction stack is a terminal
- Termination
- Parsing terminates when the prediction stack is
exhausted.
57Prediction move in an LL(1) PDA
58Match move in an LL(1) PDA
59Predictive parsing with an LL(1) PDA
60PDA example (1)
input
prediction stack
aap ( noot mies ) EOF
input
61PDA example (2)
input
prediction stack
aap ( noot mies ) EOF
input
replace non-terminal by transition entry
62PDA example (3)
expression EOF
prediction stack
aap ( noot mies ) EOF
input
63PDA example (4)
expression EOF
prediction stack
aap ( noot mies ) EOF
input
replace non-terminal by transition entry
64PDA example (5)
term rest-expr EOF
prediction stack
aap ( noot mies ) EOF
input
65PDA example (6)
term rest-expr EOF
prediction stack
aap ( noot mies ) EOF
input
replace non-terminal by transition entry
66PDA example (7)
- Please continue!!
- Example of parsing (ii)i
67LLgen
- LLgen is part of the Amsterdam Compiler Kit
- takes LL(1) grammar semantic actions in C and
generates a recursive descent parser - The non-terminals in the grammar can have
parameters, and rules can have local variables,
both again expressed in C. - LLgen features
- repetition operators
- advanced error handling
- parameter passing
- control over semantic actions
- dynamic conflict resolvers
68LLgen
- start from LR(1) grammar
- make grammar LL(1)
- use repetition operators
token DIGIT main line line
expr '\n' expr term '' term
term factor '' factor
factor '(' expr ') DIGIT
- add semantic actions
- attach parameters to grammar rules
- insert C-code between the symbols
LLgen
69Minimal non-left-recursive grammar for expressions
70LLgen code for a parser
Grammar
Semantics
71LLgen code for a parser
- The code from previous page resides in a file
called parser.g. LLgen converts the file to one
called parser.c, which contains a recursive
descent parser.
72LLgen interface to lexical analyzer
73LLgen interface to back-end
- LLgen handles syntax errors by inserting missing
tokens and deleting unexpected tokens - LLmessage() is invoked to notify the lexical
analyzer
74Creating a bottom-up parser automatically
- Left-to-right parse, Rightmost-derivation
- create a node when all
- children are present
- handle nodes representing
- the right-hand side of a
- production
75LR(0) Parsing
- Theoretically important but too weak to be
useful. - running example expression grammar
- input ? expression EOF
- expression ? expression term term
- term ? IDENTIFIER ( expression )
- short-hand notation
- Z ? E
- E ? E T T
- T ? i ( E )
76LR(0) Parsing
- keep track of progress inside potential
- handles when consuming input tokens
- LR items N ? ? ? ?
- initial set
S0
Z ? E E ? E T E ? T T ? i T ? ( E )
77? Closure algorithm for LR(0)
The important part is the inference rule it
predicts new handle hypotheses from the
hypothesis that we are looking for a certain
non-terminal, and is sometimes called prediction
rule it corresponds to an ? move, in that it
allows the automation to move to another state
without consuming input.
Reduce item an item with the dot at the
end Shift item the others
78Transition Diagram
S2
T
E ? T ?
i
S1
T ? i ?
E
i
S4
E ? E ? T T ? ? i T ? ? ( E )
T
S6
Z ? E ?
79LR(0) parsing example (1)
Z ? E E ? E T E ? T T ? i T ? ( E )
- shift input token (i) onto the stack
- compute new state
80LR(0) parsing example (2)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 i S1
i
- reduce handle on top of the stack
- compute new state
81LR(0) parsing example (3)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 T S2
i
i
- reduce handle on top of the stack
- compute new state
82LR(0) parsing example (4)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 E S3
i
T
- shift input token on top of the stack
- compute new state
i
83LR(0) parsing example (5)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 E S3 S4
i
T
- shift input token on top of the stack
- compute new state
i
84LR(0) parsing example (6)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 E S3 S4 i S1
T
- reduce handle on top of the stack
- compute new state
i
85LR(0) parsing example (7)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 E S3 S4 T S5
T
i
- reduce handle on top of the stack
- compute new state
i
86LR(0) parsing example (8)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 E S3
E
T
- shift input token on top of the stack
- compute new state
i
T
i
87LR(0) parsing example (9)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 E S3 S6
E
T
- reduce handle on top of the stack
- compute new state
i
T
i
88LR(0) parsing example (10)
Z ? E E ? E T E ? T T ? i T ? ( E )
stack
input
S0 Z
E
E
T
i
T
i
89Precomputing the item set (1)
90Precomputing the item set (2)
91Complete transition diagram
92The LR push-down automation
- Two major moves and a minor move
- Shift move
- Remove the first token from the present input and
pushes it onto the stack - Reduce move
- N -gt ?
- ? are moved from the stack
- N is then pushed onto the stack
- Termination
- The input has been parsed successfully when it
has been reduced to the start symbol.
93GOTO and ACTION tables
94LR(0) parsing of the input ii
95LR comments
- The bottom-up parsing, unlike the top-down
parsing, has no problems with left-recursion. - On the other hand, bottom-up parsing has a slight
problem with right-recursion.
96LR(0) conflicts (1)
- shift-reduce conflict
- array indexing T ? i E
- T ? i ? E (shift)
- T ? i ? (reduce)
- ?-rule RestExpr ? ?
- Expr ? Term ? RestExpr (shift)
- RestExpr ? ? (reduce)
97LR(0) conflicts (2)
- reduce-reduce conflict
- assignment statement Z ? V E
- V ? i ? (reduce)
- T ? i ? (reduce)
- (Different reduce rules)
- typical LR(0) table contains many conflicts
98Handling LR(0) conflicts
- Use a one-token look-ahead
- Use a two-dimensional ACTION table
- different construction of ACTION table
- SLR(1) Simple LR
- LR(1)
- LALR(1) Look-Ahead LR
99SLR(1) parsing
- A handle should not be reduced to a non-terminal
N if the look-ahead is a token that cannot follow
N. - reduce N ? ? iff token ? FOLLOW(N)
- FOLLOW(N)
- FOLLOW(Z)
- FOLLOW(E) , ),
- FOLLOW(T) , ),
100SLR(1) ACTION table
shift
101SLR(1) ACTION/GOTO table
1 Z ? E 2 E ? T 3 E ? E T 4 T ? i 5
T ? ( E )
s7
sn shift to state n rn reduce rule n
102Example of resolving conflicts (1)
1 Z ? E 2 E ? T 3 E ? E T 4 T ?
i 5 T ? ( E ) 6 T ? i E
103Example of resolving conflicts (2)
1 Z ? E 2 E ? T 3 E ? E T 4 T ?
i 5 T ? ( E ) 6 T ? i E
s5
T ? i. T ? i. E
104Unfortunately
- SLR(1) leaves many shift-reduce conflicts
unsolved - problem FOLLOW(N) set is a union of all all
look-aheads of all alternatives of N in all
states - example
- S ? A x b
- A ? a A b B
- B ? x
Follow (S) Follow(A) b, Follow(B) b,
105SLR(1) automation
106LR(1) parsing
- The LR(1) technique does not rely on FOLLOW sets,
but rather keeps the specific look-ahead with
each item - LR(1) item N ? ? ? ? ?
- ? - closure for LR(1) item sets
- if set S contains an item P ? ? ? N ? ? then
- for each production rule N ? ?
- S must contain the item N ? ? ? ?
- where ? FIRST( ? ? )
107Creating look-ahead sets
- Extended definition of FIRST stes
- If FIRST(?) does not contain ?, FIRST(??) is
just equal to FIRST(?) if ? can produce ?,
FIRST(??) contain all the tokens in FIRST(?),
excluding ?, plus the tokens in ?.
108LR(1) automation
109LR(1) parsing comments
- LR(1) automation is more discriminating than the
SLR(1). - In fact, it is so strong that any language that
can be parsed from left to right with a one-token
look-ahead in linear time can be parsed using the
LR(1). - LR tables are big
- Combine equal sets by merging look-ahead sets
LALR(1).
110LALR(1)
- S3 and S10 are similar in that they are equal if
one ignores the look-ahead sets, and so are S4
and S9, S6 and S11, and S8 and S12.
111LALR(1) automation
112Practice
- Derive the LALR(1) ACTION/GOTO table for the
grammar in Fig. 2.95
113Making a grammar LR(1) or not
- Although the chances for a grammar to be LR(1)
are much larger than those being SLR(1) or LL(1),
one often encounters a grammar that still is not
LR(1). The reason is generally that the grammar
is ambiguous. - For Example
- if_statement -gt if ( expression ) statement
- if (expression ) statement else
statement - statement -gt if_statement
- The statement if (xgt0) if (ygt0) p0 else q0
114Possible syntax trees (1)
115Possible syntax trees (2)
116Resolving shift-reduce conflicts (1)
- The longest possible sequence of grammar symbols
is taken for reduction. - In a shift-reduce conflict do shift.
- Another example
input i i i E ? E ? E E ? E E
?
reduce
shift
117Resolving shift-reduce conflicts (2)
- The use of precedences between tokens
- Example a shift-reduce conflict on t
- P -gt ??t? (shift item)
- Q -gt ?uR ?t (reduce item)
- where R is either empty or one non-terminal.
- If the look-ahead is t, we perform one of the
following three actions - If symbol u has a higher precedence than symbol
t, we reduce - If t has a higher precedence than symbol u, we
shift. - If both have equal precedence, we also shift
118Bottom-up parser yacc/bison
- The most widely used parser generator is yacc
- Yacc is an LALR(1) parser generator
- A yacc look-alike called bison, provided by GNU
119A very high-level view of text analysis techniques
120Yacc code example (constructing parser tree)
121Yacc code example (auxiliary code)