COP4020 Programming Languages - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

COP4020 Programming Languages

Description:

There is another parse tree for the same grammar and input: the grammar is ambiguous ... A top-down parser constructs a parse tree from the root down ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 48

Provided by: Robertva8

Category:

more less

Transcript and Presenter's Notes

Title: COP4020 Programming Languages

1
COP4020Programming Languages

Syntax
Prof. Robert van Engelen

2
Overview

Tokens and regular expressions
Syntax and context-free grammars
Grammar derivations
More about parse trees
Top-down and bottom-up parsing
Recursive descent parsing

3
Tokens

Tokens are the basic building blocks of a
programming language
Keywords, identifiers, literal values, operators,
punctuation
We saw that the first compiler phase (scanning)
splits up a character stream into tokens
Tokens have a special role with respect to
Free-format languages source program is a
sequence of tokens and horizontal/vertical
position of a token on a page is unimportant
(e.g. Pascal)
Fixed-format languages indentation and/or
position of a token on a page is significant
(early Basic, Fortran, Haskell)
Case-sensitive languages upper- and lowercase
are distinct (C, C, Java)
Case-insensitive languages upper- and lowercase
are identical (Ada, Fortran, Pascal)

4
Defining Token Patterns with Regular Expressions

The makeup of a token is described by a regular
expression (RE)
A regular expression r is one of
A character (an element of the alphabet S), e.g.
S a, b, c a
Empty, denoted by ?
Concatenation a sequence of regular
expressions r1 r2 r3 rn
Alternation regular expressions separated by a
bar r1 r2
Repetition a regular expression followed by a
star (Kleene star) r

5
Example Regular Definitions for Tokens

digit ? 0 1 2 3 4 5 6 7 8 9
unsigned_integer ? digit digit
signed_integer ? ( - ?) unsigned_integer
relop ? lt lt ltgt gt gt
letter ? a b z A B Z
id ? letter (letter digit)
Cannot use recursive definitions! digits ? digit
digits digit

6
Finite State Machines Regular Expression
Recognizers
relop ? lt lt ltgt gt gt
start
lt

0
2
1
return(relop, LE)
gt
3
return(relop, NE)
other

4
return(relop, LT)

5
return(relop, EQ)
gt

6
7
return(relop, GE)
other

8
return(relop, GT)
id ? letter ( letter digit )
letter or digit
start
letter

other
9
10
11
return(gettoken(), install_id())
7
Non-Deterministic Finite State Automata

An NFA is a 5-tuple (S, ?, ?, s0, F) whereS is
a finite set of states? is a finite set of
symbols, the alphabet? is a mapping from S ? ?
to a set of statess0 ? S is the start stateF ?
S is the set of accepting (or final) states

a
S 0,1,2,3? a,bs0 0F 3
start
a
b
b
0
1
3
2
b
8
From a Regular Expression to an NFA
?
start
?
f
i
start
a
a
f
i
?
?
N(r1)
start
r1?r2
f
i
?
?
N(r2)
start
r1r2
N(r2)
N(r1)
f
i
?
?
?
start
r
N(r)
f
i
?
9
Grammars

Context-free grammar has four componentsT is a
finite set of tokens (terminal symbols)N is a
finite set of nonterminalsP is a finite set of
productions of the form ? ? ?where ? ? (N?T) N
(N?T) and ? ? (N?T)S ? N is a designated
start symbol

10
Context Free Grammars BNF Notation

Regular expressions cannot describe nested
constructs, but context-free grammars can
Backus-Naur Form (BNF) notation for
productionsltnonterminalgt sequence of
(non)terminalswhere
Each terminal in the grammar is also a token
A ltnonterminalgt defines a syntactic category
The symbol denotes alternative forms in a
production
The special symbol ? denotes empty

11
Example
ltProgramgt program ltidgt ( ltidgt ltMore_idsgt )
ltBlockgt .ltBlockgt ltVariablesgt begin ltStmtgt
ltMore_Stmtsgt endltMore_idsgt , ltidgt
ltMore_idsgt ?ltVariablesgt var ltidgt
ltMore_idsgt ltTypegt ltMore_Variablesgt
?ltMore_Variablesgt ltidgt ltMore_idsgt ltTypegt
ltMore_Variablesgt ?ltStmtgt ltidgt
ltExpgt if ltExpgt then ltStmtgt else ltStmtgt
while ltExpgt do ltStmtgt begin ltStmtgt
ltMore_Stmtsgt endltMore_Stmtsgt ltStmtgt
ltMore_Stmtsgt ? ltExpgt ltnumgt ltidgt
ltExpgt ltExpgt ltExpgt - ltExpgt
12
Extended BNF

Extended BNF simplifies grammar definitions
Extended BNF adds
Optional constructs with and
Repetitions with
Some EBNF definitions also add for non-zero
repetitions
Any extended BNF grammar can be rewritten into
BNF

13
Example
ltProgramgt program ltidgt ( ltidgt , ltidgt )
ltBlockgt .ltBlockgt ltVariablesgt begin
ltStmtgt ltStmtgt endltVariablesgt var
ltidgt , ltidgt ltTypegt ltStmtgt ltidgt
ltExpgt if ltExpgt then ltStmtgt else ltStmtgt
while ltExpgt do ltStmtgt begin ltStmtgt ltStmtgt
end ltExpgt ltnumgt ltidgt ltExpgt
ltExpgt ltExpgt - ltExpgt
14
Derivations

From a grammar we can derive strings ( sequences
of tokens)
The opposite process of parsing
Starting with the grammars designated start
symbol, in each derivation step a nonterminal is
replaced by a right-hand side of a production for
that nonterminal

15
Example Derivation
ltexpressiongt identifier
unsigned_integer               -
ltexpressiongt               ( ltexpressiongt
)               ltexpressiongt ltoperatorgt
ltexpressiongt ltoperatorgt - /
Start symbol
ltexpressiongt ? ltexpressiongt ltoperatorgt
ltexpressiongt ? ltexpressiongt ltoperatorgt
identifier ? ltexpressiongt identifier ?
ltexpressiongt ltoperatorgt ltexpressiongt
identifier ? ltexpressiongt ltoperatorgt identifier
identifier ? ltexpressiongt identifier
identifier ? identifier identifier
identifier
Replacement of nonterminal with one of its
productions
Sentential forms
The final string is the yield
16
Rightmost versus Leftmost Derivations

When the nonterminal on the far right (left) in a
sentential form is replaced in each derivation
step the derivation is called right-most
(left-most)

Replace in rightmost derivation
ltexpressiongt ? ltexpressiongt ltoperatorgt
ltexpressiongt ? ltexpressiongt ltoperatorgt
identifier
Replace in rightmost derivation
Replace in leftmost derivation
ltexpressiongt ? ltexpressiongt ltoperatorgt
ltexpressiongt ? identifier ltoperatorgt
ltexpressiongt
Replace in leftmost derivation
17
A Language Generated by a Grammar

A context-free grammar is a generator of a
context-free language
The language defined by a grammar G is the set of
all strings w that can be derived from the start
symbol SL(G) w S ? w

ltSgt a ( ltSgt )
L(G) set of all strings a (a) ((a)) (((a)))

ltSgt ltBgt ltCgtltBgt ltCgt ltCgtltCgt 0 1
L(G) 00, 01, 10, 11, 0, 1
18
Parse Trees

A parse tree depicts the end result of a
derivation
The internal nodes are the nonterminals
The children of a node are the symbols (terminals
and nonterminals) on a right-hand side of a
production
The leaves are the terminals

ltexpressiongt
ltexpressiongt
ltoperatorgt
ltexpressiongt
ltoperatorgt
ltexpressiongt
ltexpressiongt
identifier
identifier
identifier

19
Parse Trees
ltexpressiongt ? ltexpressiongt ltoperatorgt
ltexpressiongt ? ltexpressiongt ltoperatorgt
identifier ? ltexpressiongt identifier ?
ltexpressiongt ltoperatorgt ltexpressiongt
identifier ? ltexpressiongt ltoperatorgt identifier
identifier ? ltexpressiongt identifier
identifier ? identifier identifier
identifier
ltexpressiongt
ltexpressiongt
ltoperatorgt
ltexpressiongt
ltoperatorgt
ltexpressiongt
ltexpressiongt
identifier
identifier
identifier

20
Ambiguity

There is another parse tree for the same grammar
and input the grammar is ambiguous
This parse tree is not desired, since it appears
that has precedence over

ltexpressiongt
ltexpressiongt
ltoperatorgt
ltexpressiongt
ltoperatorgt
ltexpressiongt
ltexpressiongt
identifier
identifier
identifier

21
Ambiguous Grammars

Ambiguous grammar more than one distinct
derivation of a string results in different parse
trees
A programming language construct should have only
one parse tree to avoid misinterpretation by a
compiler
For expression grammars, associativity and
precedence of operators is used to disambiguate

ltexpressiongt lttermgt ltexpressiongt ltadd_opgt
lttermgt lttermgt ltfactorgt lttermgt ltmult_opgt
ltfactorgt ltfactorgt identifier
unsigned_integer - ltfactorgt ( ltexpressiongt
) ltadd_opgt - ltmult_opgt /
22
Ambiguous if-then-elsethe Dangling Else

A classical example of an ambiguous grammar are
the grammar productions for if-then-elseltstmtgt
if ltexprgt then ltstmtgt if ltexprgt
then ltstmtgt else ltstmtgt
It is possible to hack this into unambiguous
productions for the same syntax, but the fact
that it is not easy indicates a problem in the
programming language design
Ada uses different syntax to avoid ambiguity
ltstmtgt if ltexprgt then ltstmtgt end if
if ltexprgt then ltstmtgt else ltstmtgt end if

23
Linear-Time Top-Down and Bottom-Up Parsing

A parser is a recognizer for a context-free
language
A string (token sequence) is accepted by the
parser and a parse tree can be constructed if the
string is in the language
For any arbitrary context-free grammar parsing
can take as much as O(n3) time, where n is the
size of the input
There are large classes of grammars for which we
can construct parsers that take O(n) time
Top-down LL parsers for LL grammars (LL
Left-to-right scanning of input, Left-most
derivation)
Bottom-up LR parsers for LR grammars (LR
Left-to-right scanning of input, Right-most
derivation)

24
Top-Down Parsers and LL Grammars

Top-down parser is a parser for LL class of
grammars
Also called predictive parser
LL class is a strict subset of the larger LR
class of grammars
LL grammars cannot contain left-recursive
productions (but LR can), for exampleltXgt
ltXgt ltYgt andltXgt ltYgt ltZgt ltYgt ltXgt
LL(k) where k is lookahead depth, if k1 cannot
handle alternatives in productions with common
prefixesltXgt a b a c
A top-down parser constructs a parse tree from
the root down
Not too difficult to implement a predictive
parser for an unambiguous LL(1) grammar in BNF by
hand using recursive descent

25
Top-Down Parser in Action
ltid_listgt id ltid_list_tailgtltid_list_tai
lgt , id ltid_list_tailgt
26
Top-Down Predictive Parsing

Top-down parsing is called predictive parsing
because parser predicts what it is going to
see
As root, the start symbol of the grammar
ltid_listgt is predicted
After reading A the parser predicts that
ltid_list_tailgt must follow
After reading , and B the parser predicts that
ltid_list_tailgt must follow
After reading , and C the parser predicts that
ltid_list_tailgt must follow
After reading the parser stops

27
An Ambiguous Non-LL Grammar for Language E

Consider a language E of simple expressions
composed of , -, , /, (), id, and num
Need operator precedence rules

ltexprgt ltexprgt ltexprgt ltexprgt -
ltexprgt ltexprgt ltexprgt ltexprgt /
ltexprgt ( ltexprgt ) ltidgt ltnumgt
28
An Unambiguous Non-LL Grammar for Language E
ltexprgt ltexprgt lttermgt ltexprgt -
lttermgt lttermgt lttermgt lttermgt
ltfactorgt lttermgt / ltfactorgt
ltfactorgt ltfactorgt ( ltexprgt ) ltidgt
ltnumgt
29
An Unambiguous LL(1) Grammar for Language E
ltexprgt lttermgt ltterm_tailgtlttermgt
ltfactorgt ltfactor_tailgtltterm_tailgt ltadd_opgt
lttermgt ltterm_tailgt ? ltfactorgt ( ltexprgt
) ltidgt ltnumgtltfactor_tailgt
ltmult_opgt ltfactorgt ltfactor_tailgt
?ltadd_opgt -ltmult_opgt /
30
Constructing Recursive Descent Parsers for LL(1)

Each nonterminal has a function that implements
the production(s) for that nonterminal
The function parses only the part of the input
described by the nonterminalltexprgt lttermgt
ltterm_tailgt procedure expr()
term() term_tail()
When more than one alternative production exists
for a nonterminal, the lookahead token should
help to decide which production to
applyltterm_tailgt ltadd_opgt lttermgt
ltterm_tailgt procedure term_tail()
? case (input_token()) of '' or
'-' add_op() term() term_tail()
otherwise / no op ? /

31
Some Rules to Construct a Recursive Descent Parser

For every nonterminal with more than one
production, find all the tokens that each of the
right-hand sides can start with (called the FIRST
set)ltXgt a starts with a b a
ltZgt starts with b ltYgt starts with c or d
ltZgt f starts with e or fltYgt c dltZgt
e ?
Empty productions are coded as skip operations
(nops)
If a nonterminal does not have an empty
production, the function should generate an error
if no token matches

32
Example for E
procedure factor() case (input_token()) of
'(' match('(') expr() match(')') of
identifier match(identifier) of number
match(number) otherwise error procedure
add_op() case (input_token()) of ''
match('') of '-' match('-') otherwise
error procedure mult_op() case
(input_token()) of '' match('') of '/'
match('/') otherwise error
procedure expr() term() term_tail()
procedure term_tail() case (input_token())
of '' or '-' add_op() term() term_tail()
otherwise / no op ? / procedure term()
factor() factor_tail() procedure
factor_tail() case (input_token()) of ''
or '/' mult_op() factor() factor_tail()
otherwise / no op ? /
33
Recursive Descent ParsersCall Graph Parse
Tree

The dynamic call graph of a recursive descent
parser corresponds exactly to the parse tree
Call graph of input string 123

34
Example
lttypegt ltsimplegt id
array ltsimplegt of
lttypegtltsimplegt integer
char num dotdot num
35
Example (contd)
The FIRST sets
lttypegt ltsimplegt id
array ltsimplegt of
lttypegtltsimplegt integer
char num dotdot num
integer, char, num arrayinteger char
num
36
Example (contd)
procedure simple() case (input_token())
of integer match(integer) of
char match(char) of num
match(num) match(dotdot)
match(num) otherwise error
procedure match(t token) if input_token()
t then nexttoken() else
errorprocedure type() case
(input_token()) of integer or char or
num simple() of
match() match(id) of array
match(array) match() simple()
match() match(of) type() otherwise
error
37
Step 1
type()
Check lookaheadand call match
match(array)
array

num
num
dotdot

of
integer
Input
lookahead
38
Step 2
type()
match(array)
match()
array

num
num
dotdot

of
integer
Input
lookahead
39
Step 3
type()
simple()
match(array)
match()
match(num)
array

num
num
dotdot

of
integer
Input
lookahead
40
Step 4
type()
simple()
match(array)
match()
match(num)
match(dotdot)
array

num
num
dotdot

of
integer
Input
lookahead
41
Step 5
type()
simple()
match(array)
match()
match(num)
match(num)
match(dotdot)
array

num
num
dotdot

of
integer
Input
lookahead
42
Step 6
type()
simple()
match(array)
match()
match()
match(num)
match(num)
match(dotdot)
array

num
num
dotdot

of
integer
Input
lookahead
43
Step 7
type()
simple()
match(array)
match()
match()
match(of)
match(num)
match(num)
match(dotdot)
array

num
num
dotdot

of
integer
Input
lookahead
44
Step 8
type()
simple()
match(array)
match()
match()
type()
match(of)
match(num)
match(num)
match(dotdot)
simple()
match(integer)
array

num
num
dotdot

of
integer
Input
lookahead
45
Bottom-Up LR Parsing

Bottom-up parser is a parser for LR class of
grammars
Difficult to implement by hand
Tools (e.g. Yacc/Bison) exist that generate
bottom-up parsers for LALR grammars automatically
LR parsing is based on shifting tokens on a stack
until the parser recognizes a right-hand side of
a production which it then reduces to a left-hand
side (nonterminal) to form a partial parse tree