Compiler Design Chapter 2 - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Compiler Design Chapter 2

Description:

Rule priority For the longest initial substring, the first regular expression (in terms of the order in the list of rules) that can match determine token type. ... – PowerPoint PPT presentation

Number of Views:285
Avg rating:3.0/5.0
Slides: 43
Provided by: jian9
Category:

less

Transcript and Presenter's Notes

Title: Compiler Design Chapter 2


1
Compiler Design - Chapter 2
Lexical Analysis
2
Lexical Analysis
3
Analysis
  • Program Translation from one language into
    another
  • Analysis pull the program apart to understand
    it(Compiler front end)
  • Synthesis Put it together in a different
    way(Compiler back-end)

4
Stages of Analysis
  • Lexical Analysis breaking the input up into
    individual words/tokens
  • Syntax Analysis parsing the phrase structure
    ofthe program
  • Semantic Analysis calculating the programs
    meaning

5
Lexical Analyzer
  • Input is a stream of characters
  • Produces a stream of names, keywords
    punctuationmarks
  • Discards white space comments

6
Lexical Tokens
Lexical Token stream of characters that can be
treated as a unit in the grammar/programming
language
  • Tokens such as IF, VOID, RETURN are Reserved
    words constructed from alphabetic characters
  • Can not be used as identifiers

7
Non Tokens Examples
8
Semantic Values in Lexical Analysis
  • Report semantic values attached to identifiers
    and literals

Semantic values
Token types
9
Lexical Tokens
  • Description of lexical tokens of C/Java
    identifiers
  • Identifier sequence of letters and digits
  • Underscore _ counts as a letter
  • Upper and lowercase letters are different
  • For an input stream parsed into tokens until a
    given character next token longest string
    of characters that can possibly constitute a
    token
  • Blanks, tabs, new lines comments are ignored
    except if they separate tokens
  • Some white space required to separate adjacent
    identifiers, keywords constants

10
Regular Expressions
  • A language is a set of strings
  • A string is a finite set of symbols
  • The symbols are taken from a finite alphabet
    (usually ASCII
    character set)
  • Use regular expressions to specify lexical tokens
  • Use deterministic finite automata to implement
    the lexer

11
Regular Expressions
  • Each regular expression stands for a set of
    strings.
  • Symbol For each symbol a in the alphabet,
    the regular expression a denotes the language
    containing just the string a
  • Alternation
  • Given regular expressions M and N
  • M N is a new regular expression
  • A string is in the language of M N if it is in
    the language of M or in the language of N
  • Example The language a b contains strings a
    and b

12
Regular Expressions
  • Concatenation
  • Given regular expressions M and N
  • M N is a new regular expression
  • A string is in the language of M N if it is
    the concatenation of two strings a and ß such
    that a is in the language of M and ß is in
    the language of N
  • Example The language (a b) a contains
    strings aa and ba

13
Regular Expressions
  • Epsilon
  • The regular expression ? represents a language
    whose only string is the empty string.
  • Example (a b) ? represents the language
    ,ab
  • Repetition
  • Given regular expressions M, its Kleene closure
    is M
  • A string is in M if it is the concatenation of
    zero or more strings, all of which are in M
  • Example ((a b) a) represents the infinite
    set , aa, ba, aaaa, baaa, aaba,
    baba, aaaaaaaa,

14
Regular Expressions
  • Using symbols, alternation, concatenation,
    epsilon Kleene closure,a set of ASCII
    characters can be specified tokens of a
    language, Examples
  • (01) 0 Binary numbers that are multiples of
    two
  • b(abb)(a?) Strings of as and bs with no
    consecutive as
  • (ab)aa(ab) Strings of as and bs
    containing consecutive as
  • Notation
  • Concatenation symbol or epsilon is sometimes
    omitted
  • Kleenes closure binds tighter than
    concatenation
  • Concatenation binds tighter than alternation
    Examples
  • ab c means (a b) c a means a ?

15
Regular Expressions
  • More Abbreviations
  • abcd means (a b c d )
  • b-g means bcdefg
  • b-gM-Qkr means bcdefgMNOPQkr
  • M? means (M ?)
  • M means ( M M )

16
Regular Expression Notation
comments
17
Disambiguation Rules
  • Does if8 match as a single identifier or as the
    two tokens if and 8?
  • Does the string if 89 begin with an identifier or
    a reserved word?
  • Longest Match the longest initial substring of
    the input that can match any regular expression
    is taken as the next token
  • Rule priority For the longest initial
    substring, the first regular expression (in terms
    of the order in the list of rules) that can match
    determine token type.

18
Finite Automata
  • A Formalism that can be implemented as a computer
    program.
  • A Finite Automaton
  • Finite set of States
  • Edges lead from one state to another
  • Each edge is labeled with a symbol
  • One state is the start state
  • More than one final states

19
Finite Automata
20
Deterministic Finite Automata (DFA)
  • No two (or more) edges leading from the same
    state are labeled with the same symbol
  • DFA accepts or rejects a string as follows
  • Starting at start state, for each character in
    input string the automaton follows exactly one
    edge to get to the next state
  • The edge must be labeled with the input character
  • After making n transitions for an n -character
    string, if the automaton is in the final state,
    it accepts the string
  • If it is not in the final state, or at any point
    there is no appropriately labeled edge to follow,
    it rejects the string
  • The language recognized by an automaton is the
    set of strings that it accepts.

21
Deterministic Finite Automata (DFA)
Accepts
  • Any string in the language recognized by
    automation ID begins with a letter
  • Any single letter leads to state 2 the final
    state accepted
  • From state 2, any single letter/digit leads back
    to state 2
  • String consisting of a letter followed by any
    number of letters/digits will also be
    accepted

22
Combined Finite Automata
  • Each final state labeled with token type it
    accepts
  • State 2 can lead to ID or IF rule priority
    solves this
  • State 3 labeled by IF this token must be
    reserved word, not identifier

23
Transition Matrix
State 0 dead state - No edge
24
Recognizing the Longest Match
  • To find the longest match longest initial
    substring of the input that is a valid token,
    lexical analyzer must
  • Interpret transitions
  • Keep track of longest match so far, and position
    of match
  • Keeping track of the longest match
  • Remember the last time a final state was reached
    using variables (updated when final state is
    reached)
  • Last-Final state of most recent final state
  • Input-Position-at-Last-Final
  • When dead state a nonfinal state with no
    output transitions, is reached, the variables
    gives match of token

25
Finding Longest Match
26
Finding Longest Match
27
Non-deterministic Finite Automata (NFA)
  • NFA - choice of edges labeled with the same
    symbol, following out of a state

Example
Another Example (same language)
  • This NFA recognizes the set of all strings of
    as whose length is a multiple of two or three

28
Converting a Regular Expression to an NFA
NFA with a tail (start edge) and a head (ending
state)
Regular Expression a
Regular Expression ab
Regular Expression M
29
Translation of Regular Expressions to NFAs
30
Four Regular Expressions Translated to NFA
31
NFA to DFA
  • Implementing NFA as a computer program is harder
    than DFA
  • Computers cannot guess between alternatives !
  • To avoid guessing, we have to try every
    possibility!
  • Without eating the first character of the input,
    the only reachable states from state 1 are 1,
    4, 9, 14.

e-closure of 1
32
Example NFA on string in
e-closure of 1 1, 4, 9, 14
without eating the first char
eat char i
2, 5, 15
e-closure of 5 5, 8, 6
2, 5, 6, 8, 15
eat char n
e-closure of 7 6, 7, 8
7
final state 8 ? ID(in)
33
Closure
edge(s,c) set of all NFA states reachable by
following a single edge from state s with label c
For a set of states S, closure(S) smallest set T
Set of states that can be reached from a state in
S without consuming any of the input, i.e., by
going only through e-edges.
34
Closure
Calculate T by iteration
T can only grow in each iteration (the final T
includes S). The algorithm must terminate since
only a finite number of distinct states in the
NFA.
35
DFAedge(d, c)
By starting in a set of states d si, sk, sl
and eating the input symbol c, we reach a new set
of NFA states
  • s1 start state
  • input string c1, ck

36
Example NFA on string in
e-closure of 1 1, 4, 9, 14
without eating the first char
eat char i
2, 5, 15
e-closure of 5 5, 8, 6
2, 5, 6, 8, 15
eat char n
e-closure of 7 6, 7, 8
7
final state 8 ? ID(in)
37
DFA Construction
if
DFA state 1 a set of NFA states
  • Each set of NFA states corresponds to one DFA
    state
  • DFA have at most (2n) of states since the NFA
    has a a finite number n of states

38
State Tree
39
NFA converted to DFA
  • A state d is final in the DFA if any NFA state
    in states d is final in the NFA.
  • Label d with rule priority when several states
    are final.

40
Equivalent States
Two states s1 and s2 are equivalent when the
machine starting in s1 accepts a string s if and
only if starting in s2 accepts s
Example
5,6,8,15 and 6,7,8 10,11,13,15 and
11,12,13
How to find equivalent states?
s1 and s2 are equivalent if they are both final
or both nonfinal and, for any symbol c, transs1,
c transs2, c.
41
Equivalent States
trans2,a ! trans4,a
Are state 2 and 4 equivalent?
42
JavaCC
Write a Comment
User Comments (0)
About PowerShow.com