Title: Chapter 2 Lexical Analysis
1Chapter 2 Lexical Analysis
2Lexical Analysis
- Lexical analysis recognizes the vocabulary of the
programming language and transforms a string of
characters into a string of words or tokens - Lexical analysis discards white spaces and
comments between the tokens - Lexical analyzer (or scanner) is the program that
performs lexical analysis
3Outline
- Scanners
- Tokens
- Regular expressions
- Finite automata
- Automatic conversion from regular expressions to
finite automata - FLex - a scanner generator
4Scanners
token
Parser
Scanner
characters
next token
Symbol Table
5Tokens
- A token is a sequence of characters that can be
treated as a unit in the grammar of a programming
language - A programming language classifies tokens into a
finite set of token types Type Examples ID foo
i n NUM 73 13 IF if COMMA ,
6Semantic Values of Tokens
- Semantic values are used to distinguish different
tokens in a token type - lt ID, foogt, lt ID, i gt, lt ID, n gt
- lt NUM, 73gt, lt NUM, 13 gt
- lt IF, gt
- lt COMMA, gt
- Token types affect syntax analysis and semantic
values affect semantic analysis
7Scanner Generators
Scanner definition in matalanguage
Scanner Generator
Scanner
Program in programming language
Token types semantic values
Scanner
8Languages
- A language is a set of strings
- A string is a finite sequence of symbols taken
from a finite alphabet - The C language is the (infinite) set of all
strings that constitute legal C programs - The language of C reserved words is the (finite)
set of all alphabetic strings that cannot be used
as identifiers in the C programs - Each token type is a language
9Regular Expressions (RE)
- A language allows us to use a finite description
to specify a (possibly infinite) set - RE is the metalanguage used to define the token
types of a programming language
10Regular Expressions
- ? is a RE denoting L ?
- If a ? alphabet, then a is a RE denoting L a
- Suppose r and s are RE denoting L(r) and L(s)
- alternation (r) (s) is a RE denoting L(r) ?
L(s) - concatenation (r) (s) is a RE denoting
L(r)L(s) - repetition (r) is a RE denoting (L(r))
- (r) is a RE denoting L(r)
11Examples
- a b a, b
- (a b)(a b) aa, ab, ba, bb
- a ?, a, aa, aaa, ...
- (a b) the set of all strings of as and bs
- a ab the set containing the string a and
all strings consisting of zero or more as
followed by a b
12Regular Definitions
- Names for regular expressions d1 ? r1 d2 ?
r2 ... dn ? rnwhere ri over alphabet ?
d1, d2, ..., di-1 - Examples letter ? A B ... Z a b
... z digit ? 0 1 ... 9 identifier ?
letter ( letter digit )
13Notational Abbreviations
- One or more instances (r) denoting (L(r)) r
r ? r r r - Zero or one instance r? r ?
- Character classes abc a b c a-z a
b ... z abc any character except a
b c - Any character except newline .
14Examples
- if return IF
- a-za-z0-9 return ID
- 0-9 return NUM
- (0-9.0-9)(0-9.0-9) return
REAL - (--a-z\n)( \n \t) /do
nothing for white spaces and comments/ - . error()
15Completeness of REs
- A lexical specification should be complete
namely, it always matches some initial substring
of the input . / match any /
16Disambiguity of REs (1)
- Longest match disambiguation rules the longest
initial substring of the input that can match any
regular expression is taken as the next token
(0-9.0-9)(0-9.0-9) / REAL
/ 0.9
17Disambiguity of REs (2)
- Rule priority disambiguation rules for a
particular longest initial substring, the first
regular expression that can match determines its
token type if / IF /
a-za-z0-9 / ID / if
18Finite Automata
- A finite automaton is a finite-state transition
diagram that can be used to model the recognition
of a token type specified by a regular expression - A finite automaton can be a nondeterministic
finite automaton or a deterministic finite
automaton
19Nondeterministic Finite Automata (NFA)
- An NFA consists of
- A finite set of states
- A finite set of input symbols
- A transition function that maps (state, symbol)
pairs to sets of states - A state distinguished as start state
- A set of states distinguished as final states
20An Example
start
- RE (a b)abb
- States 1, 2, 3, 4
- Input symbols a, b
- Transition function(1,a) 1,2, (1,b)
1(2,b) 3, (3,b) 4 - Start state 1
- Final state 4
a,b
a
b
b
21Acceptance of NFA
- An NFA accepts an input string s iff there is
some path in the finite-state transition diagram
from the start state to some final state such
that the edge labels along this path spell out s - The language recognized by an NFA is the set of
strings it accepts
22An Example
(a b)abb
aabb
a
a
b
b
start
1
4
2
3
b
23An Example
aaba
(a b)abb
a
a
b
b
start
1
4
2
3
b
24Another Example
- RE aa bb
- States 1, 2, 3, 4, 5
- Input symbols a, b
- Transition function(1, ?) 2, 4, (2, a)
3, (3, a) 3,(4, b) 5, (5, b) 5 - Start state 1
- Final states 3, 5
25Finite-State Transition Diagram
aa bb
a
a
2
3
start
1
4
5
b
b
aaa
26Operations on NFA states
- ?-closure(s) set of states reachable from a
state s on ?-transitions alone - ?-closure(S) set of states reachable from some
state s in S on ?-transitions alone - move(s, c) set of states to which there is a
transition on input symbol c from a state s - move(S, c) set of states to which there is a
transition on input symbol c from some state s in
S
27An Example
aa bb
a
S0 1 S1 ?-closure(1) 1,2,4 S2
move(1,2,4,a) 3 S3 ?-closure(3)
3 S4 move(3,a) 3 S5 ?-closure(3)
3 S6 move(3,a) 3 S7 ?-closure(3)
3 3 is in 3, 5 ? accept
a
2
3
start
1
4
5
b
b
aaa
28Simulating an NFA
Input An input string ended with eof and an NFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin S ?-closure(s0) c
nextchar while c ltgt eof do begin S
?-closure(move(S, c)) c nextchar end
if S ? F ltgt ? then return yes else return
no end.
29Computation of ?-closure
(a b)abb
a
4
3
start
a
b
b
11
10
1
2
8
9
7
b
?-closure(1) 1,2,3,5,8
5
6
?-closure(4) 2,3,4,5,7,8
30Computation of ?-closure
Input An NFA and a set of NFA states S. Output
T ?-closure(S). begin push all states in S
onto stack T S while stack is not empty
do begin pop t, the top element, off of
stack for each state u with an edge from t
to u labeled ? do if u is not in T then
begin add u to T push u onto
stack end end return T end.
31Deterministic Finite Automata (DFA)
- A DFA is a special case of an NFA in which
- no state has an ?-transition
- for each state s and input symbol a, there is at
most one edge labeled a leaving s
32An Example
- RE (a b)abb
- States 1, 2, 3, 4
- Input symbols a, b
- Transition function(1,a) 2, (2,a) 2, (3,a)
2, (4,a) 2(1,b) 1, (2,b) 3, (3,b) 4,
(4,b) 1 - Start state 1
- Final state 4
33Finite-State Transition Diagram
34Acceptance of DFA
- A DFA accepts an input string s iff there is one
path in the finite-state transition diagram from
the start state to some final state such that the
edge labels along this path spell out s - The language recognized by a DFA is the set of
strings it accepts
35An Example
(a b)abb
aabb
36An Example
(a b)abb
aaba
b
a
b
b
start
1
4
2
3
a
a
b
a
37An Example
bbababb s 1 s move(1, b) 1 s move(1,
b) 1 s move(1, a) 2 s move(2, b) 3 s
move(3, a) 2 s move(2, b) 3 s move(3, b)
4 4 is in 4 ? accept
38Simulating a DFA
Input An input string ended with eof and a DFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin s s0 c nextchar
while c ltgt eof do begin s move(s, c)
c nextchar end if s is in F then return
yes else return no end.
39Combined Finite Automata
i
f
start
if
1
2
3
IF
ID
a-z
start
a-z,0-9
a-za-z0-9
1
2
REAL
0-9
.
(0-9.0-9) (0-9.0-9)
0-9
3
2
0-9
start
1
0-9
.
4
5
0-9
REAL
40Combined Finite Automata
i
f
2
3
4
IF
?
ID
a-z
start
a-z,0-9
5
6
1
?
?
REAL
0-9
.
0-9
9
8
0-9
7
0-9
.
10
11
0-9
NFA
REAL
41Combined Finite Automata
f
IF
ID
2
3
g-z
a-z,0-9
a-e
i
4
a-z,0-9
j-z
ID
0-9
start
a-h
1
REAL
0-9
.
6
5
0-9
.
0-9
7
8
0-9
DFA
REAL
42Recognizing the Longest Match
- The automaton must keep track of the longest
match seen so far and the position of that match
until a dead state is reached - Use two variables Last-Final (the state number of
the most recent final state encountered) and
Input-Position-at-Last-Final to remember the last
time the automaton was in a final state
43An Example
ID
IF
2
3
iffail
S C L P 1 0 0 i 2 2
1 f 3 3 2 f 4 4 3 a 4 4
4 i 4 4 5 l 4 4 6 ?
g-z
a-z,0-9
a-e
i
4
a-z,0-9
j-z
ID
0-9
start
a-h
1
REAL
0-9
.
6
5
0-9
.
0-9
7
8
0-9
DFA
REAL
44Scanner Generators
45Flex A Scanner Generator
A language for specifying scanners
Flex compiler
lex.yy.c
lang.l
C compiler -lfl
a.out
lex.yy.c
a.out
tokens
source code
46Flex Programs
auxiliary declarationsregular
definitionstranslation rulesauxiliary
procedures
47Translation Rules
P1 action1 P2 action2 ... Pn actionn
where Pi are regular expressions and actioni are
C program segments
48Example 1
username printf( s, getlogin() )
By default, any text not matched by a flex
scanner is copied to the output. This scanner
copies its input file to its output with each
occurrence of username being replaced with the
users login name.
49Example 2
int lines 0, chars 0 \n lines
chars . chars / all characters except \n
/ main() yylex() printf(lines
d, chars d\n, lines, chars)
50Example 3
define EOF 0 define LE 25 ... delim
\t\n ws delim letter A-Za-z digit 0-9 i
d letter(letterdigit) number digit(\.
digit)?(E\-?digit)?
51Example 3
ws / no action and no return /
if return (IF) else return
(ELSE) id yylvalinstall_id() return
(ID) number yylvalinstall_num() return
(NUMBER) lt yylvalLE return
(RELOP) yylvalEQ return (RELOP)
... ltltEOFgtgt return(EOF) install_id() ...
install_num() ...
52Functions and Variables
yylex() a function implementing the lexical
analyzer and returning the token
matched yytext a global pointer variable
pointing to the lexeme matched yyleng a
global variable giving the length of the lexeme
matched yylval an external global variable
storing the attribute of the token
53NFA from Flex Programs
P1 P2 ... Pn
54Rules
- Look for the longest lexeme
- number
- Look for the first-listed pattern that
matchesthe longest lexeme - keywords and identifiers
- List frequently occurring patterns first
- white space
55Rules
- View keywords as exceptions to the rule of
identifiers - construct a keyword table
56Rules
- Start condition ltsgtr match r only in start
condition s - Start conditions are declared in the first
section using either s or x s str - A start condition is activated using the BEGIN
action \ BEGIN(str) ltstrgt / eat up
string body / - The default start condition is INITIAL
ltstrgt\ BEGIN(INITIAL)
57Lexical Error Recovery
- Error none of patterns matches a prefix of the
remaining input - Panic mode error recovery
- delete successive characters from the remaining
input until the pattern-matching can continue
58Maintaining Line Number
- Flex allows to maintain the number of the current
line in the global variable yylineno using the
following option mechanism option
yylinenoin the first section
59From a RE to an NFA
- Thompsons construction algorithm
- For ? , construct
- For a in alphabet, construct
?
start
i
f
start
a
f
i
60From a RE to an NFA
- Suppose N(s) and N(t) are NFA for RE s and t
- for s t, construct
- for s t, construct
is
fs
N(s)
start
f
i
it
ft
N(t)
fs
start
i
N(s)
N(t)
it
61From a RE to an NFA
- for s, construct
- for (s), use N(s)
start
is
fs
i
N(s)
62An Example
(a b)abb
63From an NFA to a DFA
Subset construction Algorithm. Input An NFA
N. Output A DFA D with states Dstates and
trasition table Dtran. begin add ?-closure(s0)
as an unmarked state to Dstates while there
is an unmarked state T in Dstates do begin
mark T for each input symbol a do begin
U ?-closure(move(T, a)) if U
is not in Dstates then add U as an
unmarked state to Dstates DtranT, a
U end end.
64An Example
(a b)abb
a
4
3
start
a
b
b
11
1
2
8
9
10
7
b
5
6
65An Example
?-closure(1) 1,2,3,5,8 A ?-closure(move(A,
a))?-closure(4,9) 2,3,4,5,7,8,9
B ?-closure(move(A, b))?-closure(6)
2,3,5,6,7,8 C ?-closure(move(B,
a))?-closure(4,9) B ?-closure(move(B,
b))?-closure(6,10) 2,3,5,6,7,8,10
D ?-closure(move(C, a))?-closure(4,9)
B ?-closure(move(C, b))?-closure(6)
C ?-closure(move(D, a))?-closure(4,9)
B ?-closure(move(D, b))?-closure(6,11)
2,3,5,6,7,8,11 E ?-closure(move(E,
a))?-closure(4,9) B ?-closure(move(E,
b))?-closure(6) C
66An Example
Input Symbol
State
a
b
A 1,2,3,5,8
B
C
B 2,3,4,5,7,8,9
B
D
C 2,3,5,6,7,8
B
C
D 2,3,5,6,7,8,10
B
E
E 2,3,5,6,7,8,11
B
C
67An Example
start