Title: Parsing I
1Parsing I
Please read 4.1 to 4.3
- Parsing vs. Scanning
- Context Free Grammars
- Bottom-up and Top-down parsing
Token Stream
Parser
2Recall the Structure of a Compiler
character stream
Lexical Analysis
token stream
Parsing
Front End
syntax tree
Semantic Analysis
syntax tree
Intermediate Code Generate
Symbol Table
intermediate code
Optimization
Back End
intermediate code
Code Generation
target machine code
3Heres where we are now
character stream
Lexical Analysis
token stream
Parsing
Front End
syntax tree
Semantic Analysis
syntax tree
Intermediate Code Generate
Symbol Table
intermediate code
Optimization
Back End
intermediate code
Code Generation
target machine code
4An Exercise
Can you create a regular expression to determine
if brackets are matched? Good Bad
TT
5Unbounded counting
Can you create a regular expression to determine
if brackets are matched? Good Bad
We would need to be able to count any number of
parenthesis to determine a match. We cant do
unbounded counting? (Why not?)
6Regular Expressions
Regular expressions lack the expressive power to
specify syntax! So We cant use them to
describe syntax!
7Context-free Grammars (CFGs)
- An important class of formal grammars
- Generally sufficient to express the syntax of
modern programming languages - Efficient implementations exists
- O(n3) worst case time.
- O(n) time for most grammars.
- Grammar will be described using rules which then
map to the nodes of syntax trees.
8A C Example
A simple C program fragment do i
while(i lt 1000)
Whats a reasonable syntax tree?
TT
9Example Syntax Tree
A simple C program fragment do i
while(i lt 1000)
do-while
lt
ltid,1gt
1000
ltid,1gt
do statement while ( expression )
This is a way we might describe this type of
statement (less the expressions and nested
statement)
10Definition of a Context-free Grammar
- A set of terminal symbols (tokens). These are the
elementary symbols of the language defined by the
grammar. - A set of nonterminal symbols (syntactic
variables). Each nonterminal represents a set of
strings of terminals. - A set of productions, where each production
consists of a nonterminal called the head or left
side of the production, an arrow, and a sequence
of terminals and/or nonterminals called the body. - A designation of one nonterminal as the start
symbol.
11Just suppose
What are the possible statements if this is all
our language could do? do i while(i lt
1000) What are the possible expressions?
Statements are lines of code. Well consider a
program to be statement statement statement etc.
An expression is a function that returns a value.
12Just suppose
What are the possible statements if this is all
our language could do? do i while(i lt
1000) What are the possible expressions?
do statement while ( expression ) id id lt
num
Statements
Expressions
13Definition of a Context-free Grammar
- A set of terminal symbols (tokens). These are the
elementary symbols of the language defined by the
grammar. - do while ( ) lt id num ?
Epsilon, means nothing.
do statement while ( expression ) id id lt
num
Statements
Expressions
14Our start nonterminal
4. A designation of one nonterminal as the start
symbol.
program ? ????
do statement while ( expression ) id id lt
num
Statements
Expressions
15Our start nonterminal
4. A set of productions, where each production
consists of a nonterminal called the head or left
side of the production, an arrow, and a sequence
of terminals and/or nonterminals called the body.
program ? statements
Pretty easy, huh? Whats the production for
statements?
Production
do statement while ( expression ) id id lt
num
Statements
Expressions
16Statements
program ? statements statements ? statement
statements ?
Theres no like in regular expressions. So, we
have either a statement followed by more
statements or we have nothing. How does i
j k fit into these productions?
do statement while ( expression ) id id lt
num
Statements
Expressions
17Statement
program ? statements statements ? statement
statements ? statement ? do statement while (
expression ) id
statements
Expressions?
do statement while ( expression ) id id lt
num
Statements
Expressions
18Statement
program ? statements statements ? statement
statements ? statement ? do statement while (
expression ) id
statements
expression ? id lt num
This is a complete grammar for this minimum
language.
do statement while ( expression ) id id lt
num
Statements
Expressions
TT
19Statement
program ? statements statements ? statement
statements ? statement ? do statement while (
expression )
expression
statements expression ? id lt num id
do statement while ( expression ) id id lt
num
Statements
Expressions
20Balanced Parenthesis CFG
program ? S S? ( S) S ?
If a grammar accepts a string, there exists at
least one derivation of that string using the
productions one at a time.
() program ? S? ( S) S ? ( ) S ? ( )
(()) program ? S ? ( S ) S ? ( ( S ) S ) S ? (
( ) ) S ? ( ( ) ) ()()
21Expressions
What about more general expressions including ,
, and parenthesis and any mix of numbers and ids?
a b a b b c d b c d b c d a (12
c) 5 ((a b) (c d) 4)
22Expressions
expression ? expression t t t ? t f
f f ? ( expression ) id num
Create a parse tree for each one. Each
production becomes an interior node of the tree.
a b a b b c d b c d b c d a (12
c) 5 ((a b) (c d) 4)
TT
23Is a regular language also a context-free
language?
For any given regular expression, can be
construct a context-free grammar that accepts the
same strings?
24Yep, Regular Languages are Context-Free Languages
e a a b a b a
? s ? a s ? a b s ? a b s ? s1 s1 ? s1 s1
?
Just apply these rules recursively and you can
convert a regular expression to a context-free
grammar. (what about parenthesis?)
a ab (ab)
TT
25Grammars to General Parsers
Parser
Context-free Grammar G
Yes, if s in L(G) No, otherwise
Token stream s
Error messages
A general parser (syntax analyzer) indicates if a
token stream is in a given grammar. Its an
acceptor. Syntax trees are a (useful) side
effect its easy to add.
26Parsing Methods
Top-down
We start with the start symbol and expand from
there. We build the tree from the root down to
the leaves.
Bottom-up
We start from tokens (leaves of the tree) and
build the tree upward.
Top-down parsers are easy to write, but place
restrictions on the grammar. Bottom-up parsers
are usually machine generated, but can
accommodate a larger set of grammars.
27Bottom-up Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num
Tokenized
28Bottom-up Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num ( f f ) f
Apply 3 times f ? ( e ) id num
29Bottom-up Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num ( f f ) f (
t f ) f
Apply t ? t f f
30Bottom-up Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num ( f f ) f (
t f ) f (e f ) f
Apply e ? e t t
31Bottom-up Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num ( f f ) f (
t f ) f (e f ) f ( e ) f f f t f t e
32Top-down parsing
Begin with the start symbol. For each iteration,
replace one nonterminal with a production. Keep
going until we construct the input stream.
33Top-down Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num e
Tokenized
Begin with start symbol
34Top-down Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num e t
Apply e ? e t t
35Top-down Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num e t t f
Apply t ? t f f
36Top-down Parsing
e ? e t t t ? t f f f ? ( e ) id
num
(2 3) 7 ( num num ) num e t t f
Apply t ? t f f
37Top-down Parsing
(2 3) 7 ( num num ) num e t t f f
f ( e ) f ( e t) f ( t t ) f ( f t )
f ( num t ) f ( num f ) f ( num num )
f ( num num ) num
e ? e t t t ? t f f f ? ( e ) id
num