Title: Compiler Design Chapter 2
1Compiler Design - Chapter 2
Lexical Analysis
2Lexical Analysis
3Analysis
- Program Translation from one language into
another - Analysis pull the program apart to understand
it(Compiler front end) - Synthesis Put it together in a different
way(Compiler back-end)
4Stages of Analysis
- Lexical Analysis breaking the input up into
individual words/tokens - Syntax Analysis parsing the phrase structure
ofthe program - Semantic Analysis calculating the programs
meaning
5Lexical Analyzer
- Input is a stream of characters
- Produces a stream of names, keywords
punctuationmarks - Discards white space comments
6Lexical Tokens
Lexical Token stream of characters that can be
treated as a unit in the grammar/programming
language
- Tokens such as IF, VOID, RETURN are Reserved
words constructed from alphabetic characters - Can not be used as identifiers
7Non Tokens Examples
8Semantic Values in Lexical Analysis
- Report semantic values attached to identifiers
and literals
Semantic values
Token types
9Lexical Tokens
- Description of lexical tokens of C/Java
identifiers - Identifier sequence of letters and digits
- Underscore _ counts as a letter
- Upper and lowercase letters are different
- For an input stream parsed into tokens until a
given character next token longest string
of characters that can possibly constitute a
token - Blanks, tabs, new lines comments are ignored
except if they separate tokens - Some white space required to separate adjacent
identifiers, keywords constants
10Regular Expressions
- A language is a set of strings
- A string is a finite set of symbols
- The symbols are taken from a finite alphabet
(usually ASCII
character set) - Use regular expressions to specify lexical tokens
- Use deterministic finite automata to implement
the lexer
11Regular Expressions
- Each regular expression stands for a set of
strings. - Symbol For each symbol a in the alphabet,
the regular expression a denotes the language
containing just the string a - Alternation
- Given regular expressions M and N
- M N is a new regular expression
- A string is in the language of M N if it is in
the language of M or in the language of N - Example The language a b contains strings a
and b
12Regular Expressions
- Concatenation
- Given regular expressions M and N
- M N is a new regular expression
- A string is in the language of M N if it is
the concatenation of two strings a and ß such
that a is in the language of M and ß is in
the language of N - Example The language (a b) a contains
strings aa and ba
13Regular Expressions
- Epsilon
- The regular expression ? represents a language
whose only string is the empty string. - Example (a b) ? represents the language
,ab - Repetition
- Given regular expressions M, its Kleene closure
is M - A string is in M if it is the concatenation of
zero or more strings, all of which are in M - Example ((a b) a) represents the infinite
set , aa, ba, aaaa, baaa, aaba,
baba, aaaaaaaa,
14Regular Expressions
- Using symbols, alternation, concatenation,
epsilon Kleene closure,a set of ASCII
characters can be specified tokens of a
language, Examples - (01) 0 Binary numbers that are multiples of
two - b(abb)(a?) Strings of as and bs with no
consecutive as - (ab)aa(ab) Strings of as and bs
containing consecutive as - Notation
- Concatenation symbol or epsilon is sometimes
omitted - Kleenes closure binds tighter than
concatenation - Concatenation binds tighter than alternation
Examples - ab c means (a b) c a means a ?
15Regular Expressions
- More Abbreviations
- abcd means (a b c d )
- b-g means bcdefg
- b-gM-Qkr means bcdefgMNOPQkr
- M? means (M ?)
- M means ( M M )
16Regular Expression Notation
comments
17Disambiguation Rules
- Does if8 match as a single identifier or as the
two tokens if and 8? - Does the string if 89 begin with an identifier or
a reserved word?
- Longest Match the longest initial substring of
the input that can match any regular expression
is taken as the next token - Rule priority For the longest initial
substring, the first regular expression (in terms
of the order in the list of rules) that can match
determine token type.
18Finite Automata
- A Formalism that can be implemented as a computer
program. - A Finite Automaton
- Finite set of States
- Edges lead from one state to another
- Each edge is labeled with a symbol
- One state is the start state
- More than one final states
19Finite Automata
20Deterministic Finite Automata (DFA)
- No two (or more) edges leading from the same
state are labeled with the same symbol - DFA accepts or rejects a string as follows
- Starting at start state, for each character in
input string the automaton follows exactly one
edge to get to the next state - The edge must be labeled with the input character
- After making n transitions for an n -character
string, if the automaton is in the final state,
it accepts the string - If it is not in the final state, or at any point
there is no appropriately labeled edge to follow,
it rejects the string - The language recognized by an automaton is the
set of strings that it accepts.
21Deterministic Finite Automata (DFA)
Accepts
- Any string in the language recognized by
automation ID begins with a letter - Any single letter leads to state 2 the final
state accepted - From state 2, any single letter/digit leads back
to state 2 - String consisting of a letter followed by any
number of letters/digits will also be
accepted
22Combined Finite Automata
- Each final state labeled with token type it
accepts - State 2 can lead to ID or IF rule priority
solves this - State 3 labeled by IF this token must be
reserved word, not identifier
23Transition Matrix
State 0 dead state - No edge
24Recognizing the Longest Match
- To find the longest match longest initial
substring of the input that is a valid token,
lexical analyzer must - Interpret transitions
- Keep track of longest match so far, and position
of match - Keeping track of the longest match
- Remember the last time a final state was reached
using variables (updated when final state is
reached) - Last-Final state of most recent final state
- Input-Position-at-Last-Final
- When dead state a nonfinal state with no
output transitions, is reached, the variables
gives match of token
25Finding Longest Match
26Finding Longest Match
27Non-deterministic Finite Automata (NFA)
- NFA - choice of edges labeled with the same
symbol, following out of a state
Example
Another Example (same language)
- This NFA recognizes the set of all strings of
as whose length is a multiple of two or three
28Converting a Regular Expression to an NFA
NFA with a tail (start edge) and a head (ending
state)
Regular Expression a
Regular Expression ab
Regular Expression M
29Translation of Regular Expressions to NFAs
30Four Regular Expressions Translated to NFA
31NFA to DFA
- Implementing NFA as a computer program is harder
than DFA - Computers cannot guess between alternatives !
- To avoid guessing, we have to try every
possibility! - Without eating the first character of the input,
the only reachable states from state 1 are 1,
4, 9, 14.
e-closure of 1
32Example NFA on string in
e-closure of 1 1, 4, 9, 14
without eating the first char
eat char i
2, 5, 15
e-closure of 5 5, 8, 6
2, 5, 6, 8, 15
eat char n
e-closure of 7 6, 7, 8
7
final state 8 ? ID(in)
33Closure
edge(s,c) set of all NFA states reachable by
following a single edge from state s with label c
For a set of states S, closure(S) smallest set T
Set of states that can be reached from a state in
S without consuming any of the input, i.e., by
going only through e-edges.
34Closure
Calculate T by iteration
T can only grow in each iteration (the final T
includes S). The algorithm must terminate since
only a finite number of distinct states in the
NFA.
35DFAedge(d, c)
By starting in a set of states d si, sk, sl
and eating the input symbol c, we reach a new set
of NFA states
- s1 start state
- input string c1, ck
36Example NFA on string in
e-closure of 1 1, 4, 9, 14
without eating the first char
eat char i
2, 5, 15
e-closure of 5 5, 8, 6
2, 5, 6, 8, 15
eat char n
e-closure of 7 6, 7, 8
7
final state 8 ? ID(in)
37DFA Construction
if
DFA state 1 a set of NFA states
- Each set of NFA states corresponds to one DFA
state
- DFA have at most (2n) of states since the NFA
has a a finite number n of states
38State Tree
39NFA converted to DFA
- A state d is final in the DFA if any NFA state
in states d is final in the NFA. - Label d with rule priority when several states
are final.
40Equivalent States
Two states s1 and s2 are equivalent when the
machine starting in s1 accepts a string s if and
only if starting in s2 accepts s
Example
5,6,8,15 and 6,7,8 10,11,13,15 and
11,12,13
How to find equivalent states?
s1 and s2 are equivalent if they are both final
or both nonfinal and, for any symbol c, transs1,
c transs2, c.
41Equivalent States
trans2,a ! trans4,a
Are state 2 and 4 equivalent?
42JavaCC