Compiler Design Chapter 2 - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Compiler Design Chapter 2

Description:

Rule priority For the longest initial substring, the first regular expression (in terms of the order in the list of rules) that can match determine token type. ... – PowerPoint PPT presentation

Number of Views:285

Avg rating:3.0/5.0

Slides: 43

Provided by: jian9

Category:

more less

Transcript and Presenter's Notes

Title: Compiler Design Chapter 2

1
Compiler Design - Chapter 2
Lexical Analysis
2
Lexical Analysis
3
Analysis

Program Translation from one language into
another
Analysis pull the program apart to understand
it(Compiler front end)
Synthesis Put it together in a different
way(Compiler back-end)

4
Stages of Analysis

Lexical Analysis breaking the input up into
individual words/tokens
Syntax Analysis parsing the phrase structure
ofthe program
Semantic Analysis calculating the programs
meaning

5
Lexical Analyzer

Input is a stream of characters
Produces a stream of names, keywords
punctuationmarks
Discards white space comments

6
Lexical Tokens
Lexical Token stream of characters that can be
treated as a unit in the grammar/programming
language

Tokens such as IF, VOID, RETURN are Reserved
words constructed from alphabetic characters
Can not be used as identifiers

7
Non Tokens Examples
8
Semantic Values in Lexical Analysis

Report semantic values attached to identifiers
and literals

Semantic values
Token types
9
Lexical Tokens

Description of lexical tokens of C/Java
identifiers
Identifier sequence of letters and digits
Underscore _ counts as a letter
Upper and lowercase letters are different
For an input stream parsed into tokens until a
given character next token longest string
of characters that can possibly constitute a
token
Blanks, tabs, new lines comments are ignored
except if they separate tokens
Some white space required to separate adjacent
identifiers, keywords constants

10
Regular Expressions

A language is a set of strings
A string is a finite set of symbols
The symbols are taken from a finite alphabet
(usually ASCII
character set)
Use regular expressions to specify lexical tokens
Use deterministic finite automata to implement
the lexer

11
Regular Expressions

Each regular expression stands for a set of
strings.
Symbol For each symbol a in the alphabet,
the regular expression a denotes the language
containing just the string a
Alternation
Given regular expressions M and N
M N is a new regular expression
A string is in the language of M N if it is in
the language of M or in the language of N
Example The language a b contains strings a
and b

12
Regular Expressions

Concatenation
Given regular expressions M and N
M N is a new regular expression
A string is in the language of M N if it is
the concatenation of two strings a and ß such
that a is in the language of M and ß is in
the language of N
Example The language (a b) a contains
strings aa and ba

13
Regular Expressions

Epsilon
The regular expression ? represents a language
whose only string is the empty string.
Example (a b) ? represents the language
,ab
Repetition
Given regular expressions M, its Kleene closure
is M
A string is in M if it is the concatenation of
zero or more strings, all of which are in M
Example ((a b) a) represents the infinite
set , aa, ba, aaaa, baaa, aaba,
baba, aaaaaaaa,

14
Regular Expressions

Using symbols, alternation, concatenation,
epsilon Kleene closure,a set of ASCII
characters can be specified tokens of a
language, Examples
(01) 0 Binary numbers that are multiples of
two
b(abb)(a?) Strings of as and bs with no
consecutive as
(ab)aa(ab) Strings of as and bs
containing consecutive as
Notation
Concatenation symbol or epsilon is sometimes
omitted
Kleenes closure binds tighter than
concatenation
Concatenation binds tighter than alternation
Examples
ab c means (a b) c a means a ?

15
Regular Expressions

More Abbreviations
abcd means (a b c d )
b-g means bcdefg
b-gM-Qkr means bcdefgMNOPQkr
M? means (M ?)
M means ( M M )

16
Regular Expression Notation
comments
17
Disambiguation Rules

Does if8 match as a single identifier or as the
two tokens if and 8?
Does the string if 89 begin with an identifier or
a reserved word?

Longest Match the longest initial substring of
the input that can match any regular expression
is taken as the next token
Rule priority For the longest initial
substring, the first regular expression (in terms
of the order in the list of rules) that can match
determine token type.

18
Finite Automata

A Formalism that can be implemented as a computer
program.
A Finite Automaton
Finite set of States
Edges lead from one state to another
Each edge is labeled with a symbol
One state is the start state
More than one final states

19
Finite Automata
20
Deterministic Finite Automata (DFA)

No two (or more) edges leading from the same
state are labeled with the same symbol
DFA accepts or rejects a string as follows
Starting at start state, for each character in
input string the automaton follows exactly one
edge to get to the next state
The edge must be labeled with the input character
After making n transitions for an n -character
string, if the automaton is in the final state,
it accepts the string
If it is not in the final state, or at any point
there is no appropriately labeled edge to follow,
it rejects the string
The language recognized by an automaton is the
set of strings that it accepts.

21
Deterministic Finite Automata (DFA)
Accepts

Any string in the language recognized by
automation ID begins with a letter
Any single letter leads to state 2 the final
state accepted
From state 2, any single letter/digit leads back
to state 2
String consisting of a letter followed by any
number of letters/digits will also be
accepted

22
Combined Finite Automata

Each final state labeled with token type it
accepts
State 2 can lead to ID or IF rule priority
solves this
State 3 labeled by IF this token must be
reserved word, not identifier

23
Transition Matrix
State 0 dead state - No edge
24
Recognizing the Longest Match

To find the longest match longest initial
substring of the input that is a valid token,
lexical analyzer must
Interpret transitions
Keep track of longest match so far, and position
of match
Keeping track of the longest match
Remember the last time a final state was reached
using variables (updated when final state is
reached)
Last-Final state of most recent final state
Input-Position-at-Last-Final
When dead state a nonfinal state with no
output transitions, is reached, the variables
gives match of token

25
Finding Longest Match
26
Finding Longest Match
27
Non-deterministic Finite Automata (NFA)

NFA - choice of edges labeled with the same
symbol, following out of a state

Example
Another Example (same language)

This NFA recognizes the set of all strings of
as whose length is a multiple of two or three

28
Converting a Regular Expression to an NFA
NFA with a tail (start edge) and a head (ending
state)
Regular Expression a
Regular Expression ab
Regular Expression M
29
Translation of Regular Expressions to NFAs
30
Four Regular Expressions Translated to NFA
31
NFA to DFA

Implementing NFA as a computer program is harder
than DFA
Computers cannot guess between alternatives !
To avoid guessing, we have to try every
possibility!
Without eating the first character of the input,
the only reachable states from state 1 are 1,
4, 9, 14.

e-closure of 1
32
Example NFA on string in
e-closure of 1 1, 4, 9, 14
without eating the first char
eat char i
2, 5, 15
e-closure of 5 5, 8, 6
2, 5, 6, 8, 15
eat char n
e-closure of 7 6, 7, 8
7
final state 8 ? ID(in)
33
Closure
edge(s,c) set of all NFA states reachable by
following a single edge from state s with label c
For a set of states S, closure(S) smallest set T
Set of states that can be reached from a state in
S without consuming any of the input, i.e., by
going only through e-edges.
34
Closure
Calculate T by iteration
T can only grow in each iteration (the final T
includes S). The algorithm must terminate since
only a finite number of distinct states in the
NFA.
35
DFAedge(d, c)
By starting in a set of states d si, sk, sl
and eating the input symbol c, we reach a new set
of NFA states

s1 start state
input string c1, ck

36
Example NFA on string in
e-closure of 1 1, 4, 9, 14
without eating the first char
eat char i
2, 5, 15
e-closure of 5 5, 8, 6
2, 5, 6, 8, 15
eat char n
e-closure of 7 6, 7, 8
7
final state 8 ? ID(in)
37
DFA Construction
if
DFA state 1 a set of NFA states

Each set of NFA states corresponds to one DFA
state

DFA have at most (2n) of states since the NFA
has a a finite number n of states

38
State Tree
39
NFA converted to DFA

A state d is final in the DFA if any NFA state
in states d is final in the NFA.
Label d with rule priority when several states
are final.

40
Equivalent States
Two states s1 and s2 are equivalent when the
machine starting in s1 accepts a string s if and
only if starting in s2 accepts s
Example
5,6,8,15 and 6,7,8 10,11,13,15 and
11,12,13
How to find equivalent states?
s1 and s2 are equivalent if they are both final
or both nonfinal and, for any symbol c, transs1,
c transs2, c.
41
Equivalent States
trans2,a ! trans4,a
Are state 2 and 4 equivalent?
42
JavaCC

Write a Comment

User Comments (0)