Title: Lexical Analysis
1Lexical Analysis
2Outline
- Scanners
- Tokens
- Regular expressions
- Finite automata
- Automatic conversion from regular expressions to
finite automata
3Lexical Analysis
- Lexical analysis recognizes the vocabulary of the
programming language and transforms a string of
characters into a string of words or tokens - Lexical analysis discards white spaces and
comments between the tokens - Lexical analyzer (or scanner) is the program that
performs lexical analysis
4The Reason Why Lexical Analysis is a Separate
Phase
- Simplifies the design of the compiler
- Provides efficient implementation
- Systematic techniques to implement lexical
analyzers by hand or automatically from
specifications - Stream buffering methods to scan input
- Improves portability
- Non-standard symbols and alternate character
encodings can be normalized
5Interaction of the Lexical Analyzer with the
Parser
Token,tokenval
LexicalAnalyzer
Parser
SourceProgram
Get nexttoken
error
error
Symbol Table
6Lexical Function
- 1. Read the input stream (sequence of
characters), group the characters into primitives
(tokens). - Returns token as lttype, valuegt.
- 2. Throw out certain sequences of characters
(blanks, comments, etc.). - 3. Build the symbol table, string table, constant
table, etc. - 4. Generate error messages.
- 5. Convert, for example, string ? integer.
- Tokens are described using regular expressions
7Scanner (Lexical Analyzer)Main task
- read the characters of the source language (a
stream of characters) - and break it up into tokens, the smallest
meaningful units of the source language. - Other tasks
- - remove the comments,
- - perform conversions (if needed),
- - remove the white space,
- - interpret the compiler directives,
- - prepare the listing of the source program
8Tokens, Patterns, and Lexemes
- A token is a classification of lexical units
- For example id and num
- Lexemes are the specific character strings that
make up a token - For example abc and 123
- Patterns are rules describing the set of lexemes
belonging to a token - For example letter followed by letters and
digits and non-empty sequence of digits
9Lexical Analysis Process
Preprocessed source code, read char by char
if (b 0) a b
Lexical Analysis or Scanner
if
(
b
0
)
a
b
Lexical analysis - Transform multi-character
input stream to token stream - Reduce length of
program representation (remove spaces) -
10Lexemes
- Lexemes are the lowest level syntactic units.
- Example
- val (int)(xdot y0.3)
- In the above statement, the lexemes are
- val, , ( , int, ), (, xdot, , y, , 0.3,
), ,
11Tokens
- Tokens are the atomic unit of a language , and
are usually specific strings or instances of
classes of strings - The category of lexemes are tokens.
- Identifiers Names chosen by the programmer.
- val, xdot, y
- Keywords Names chosen by the language designer
to help syntax and structure. - int, return, void.
- (Keywords that cannot be used as identifiers are
known as RESERVED WORDS)
12Tokens
- A token is a sequence of characters that can be
treated as a unit in the grammar of a programming
language - A programming language classifies tokens into a
finite set of token types Type Examples ID foo
i n NUM 73 13 IF if COMMA ,
13Tokens (Contd.)
- Integers 2 1000 -20
- Floating-point 2.0 -0.010 .02
- Symbols _at_ ltlt gtgt
- Strings x He said, I love G52CMP
- Comments / Hi and Bye /
14Tokens (Contd.)
- Operators Identify actions.
- , , !
- Literals Denote values directly.
- 3.14, -10, a, true, null
- Punctuation Symbols Supports syntactic
structure. - (, ), , ,
15Tokens (Contd.)
Token types are usually divided into the
following classes (1) Variable-length tokens,
e.g. identifiers, numerical or string constants,
keywords (2) Fixed-length tokens (2a) simple
tokens, e.g., ,-, , , . . . (2b) compound
tokens e.g., lt, ! , , 46
16Identifiers Vs. Keywords
Programming languages use fixed strings to
identify particular Keywords e.g., if, then,
else, etc. Since keywords are just identifiers
the Lexical Analyzer must distinguish between
these two possibilities. If keywords are
reservednot used as identifierswe can
initialize the symbol-table with all the keywords
and mark them as such. Then, a string is
recognized as an identifier only if it is not
already in the symbol-table as a keyword
17Token Structure (Example)
18Tokens
- Identifiers x y11 elsex
- Keywords if else while for break
- Integers 2 1000 -20
- Floating-point 2.0 -0.0010 .02 1e5
- Symbols ltlt lt lt
- Strings x He said, \I love EECS 483\
- Comments / bla bla bla /
19Semantic Values of Tokens
- Semantic values are used to distinguish different
tokens in a token type - lt ID, foogt, lt ID, i gt, lt ID, n gt
- lt NUM, 73gt, lt NUM, 13 gt
- lt IF, gt
- lt COMMA, gt
- Token types affect syntax analysis and semantic
values affect semantic analysis
20Attributes for Tokens
When a Token can be generated by different
Lexemes the Lexical Analyzer must transmit the
Lexeme to the subsequent phases of the
compiler. Such information is specified as an
Attribute associated to the Token. Usually, the
attribute of a Token is a pointer to the symbol
table entry that keeps information about the
Token. The Token influences parsing decisions
Parser relies on the token distinctions, e.g.,
identifiers are treated differently than keywords
The Attribute influences the translation phase.
21Attributes for Tokens An Example
- Example. Let us consider the following assignment
statement - E M C 2
- Then the following pairs (token, attribute) are
passed to the Parser - (id, pointer to symbol-table entry for E)
- (assign-op,)
- (id, pointer to symbol-table entry for M)
- (mult-op,)
- (id, pointer to symbol-table entry for C)
- (exp-op,)
- (num, integer value 2).
- Some Tokens have a null attribute the Token is
sufficient to identify the Lexeme. - From an implementation point of view, each
token is encoded as an integer number.
22Semantic Values of Tokens
- Example In a line of Java language
- if (mark gt 80) grade A
- tokens are
- if ( mark gt 80 ) grade
A - The scanner is concerned with putting the tokens
together. - It does not check whether the tokens are in a
correct order
23How to Describe Tokens
- Use regular expressions to describe programming
language tokens! - A regular expression (RE) is defined inductively
- a ordinary character stands for itself
- ? empty string
- RS either R or S (alteration), where R,S RE
- RS R followed by S (concatenation)
- R concatenation of R 0 or mor times (Kleene
closure)
24Lexical Analysis, How?
- First, write down the lexical specification (how
each token is defined?) - using regular expression to specify the lexical
structure - identifier letter (letter digit
underscore) - letter a ... z A ... Z
- digit 0 1 ... 9
- Second, based on the above lexical specification,
build the lexical analyzer (to recognize tokens)
by hand, - Regular Expression Spec gt NFA gt DFA
gtTransition Table gt Lexical Analyzer - Or just by using lex --- the lexical analyzer
generator - Regular Expression Spec (in lex format) gt feed
to lex gt Lexical Analyzer
25Scanner Generators
Scanner definition in matalanguage
Scanner Generator
Scanner
Program in programming language
Token types semantic values
Scanner
26Languages
- An Alphabet is a finite set of symbols.
- Example Sb 0,1 binary alphabet
- A language is a set of strings
- L1 00,01,10,11 all strings of length 2
- A string is a finite sequence of symbols taken
from a finite alphabet - 10011 from Sb
27Specification of Patterns for Tokens Regular
Expressions
- Basis symbols
- ? is a regular expression denoting language ?
- a ? ? is a regular expression denoting a
- If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then - r?s is a regular expression denoting L(r) ? M(s)
- rs is a regular expression denoting L(r)M(s)
- r is a regular expression denoting L(r)
- (r) is a regular expression denoting L(r)
- A language defined by a regular expression is
called a regular set
28Specification of Patterns for Tokens Notational
Shorthand
- The following shorthands are often used
r rr r? r?? a-z a?b?c??z - Examplesdigit ? 0-9num ? digit (. digit)?
( E (?-)? digit )?
29Languages
- The C language is the (infinite) set of all
strings that constitute legal C programs - The language of C reserved words is the (finite)
set of all alphabetic strings that cannot be used
as identifiers in the C programs - Each token type is a language
30Languages
- S denotes the set of all words over the alphabet
S. - s denotes the length of string
- e denotes the word of length 0, the empty word.
- ? denotes the empty set, different from e
- Note1 In language theory the terms sentence and
word are often used as synonyms for the term
string - Note2 A language (over alphabet S) is a set of
string (over alphabet S). For example S a
one possible language is L e, a aa aaa.
31Examples
- Length of a string
- Example
- 10011 5
- WHILE 5
- WHILE 1
- Empty string
- e 0
- Concatenation
- Two strings x and y are joined together xy xy
- Example
- x AB, y CDE produce xy ABCDE
- xy x y
- xy ? yx (not commutative)
- ex xe x
- String exponentiation
- x0 e
- x1 x
- x2 xx
- xn xxn-1, n gt 1
32Terms for parts of a string
- TERM DEFINITION
- prefix of s A string obtained by removing zero
or more trailing symbols of string s ban
is a prefix of banana. - suffix of s A string formed by deleting zero or
more of the leading symbols of s nana is
a suffix of banana. - substring of s A string obtained by deleting a
prefix and a suffix from s nan is a
substring of banana. Every prefix and every
suffix of s is a substring of s, but not every
substring of s is a prefix or a suffix of s. For
every string s, both s and e are prefixes,
suffixes, and substrings of s. - proper prefix, suffix,
- or substring of s Any nonempty string x that is,
respectively, a prefix, suffix, or substring
of s such that s ? x. - subsequence of s Any string formed by deleting
zero or more not necessarily contiguous
symbols from s baaa is a subsequence of
banana.
33Terms for parts of a string (examples)
- Let us take this string banana
- prefix
- e, b, ba, ban, ..., banana
- suffix
- e, a, na, ana, ..., banana
- substring e, b, a, n, ba, an, na, ..., banana
- subsequence e, b, a, n, ba, bn, an, aa, na,
nn, ..., banana
34Example
- L is the set A, B, . . ., Z, a, b, . . . , z
and D the set 0, 1, . . . , 9. Since a symbol
can be regarded as a string of length one, the
sets L and D are each finite languages. The
following are some examples of new languages
created from L and D - 1. L U D is the set of letters and digits.
- 2. LD is the set of strings consisting of a
letter followed by a digit. - 3. L4 is the set of all four-letter strings.
- 4. L is the set of all strings of letters,
including e, the empty string. - 5. L(L U D) is the set of all strings of letters
and digits beginning with a letter. - 6. D is the set of all strings of one or more
digits.
35How to Break up Text
1
else
x
0
- REs alone not enough, need rule for choosing when
get multiple matches - Longest matching token wins
- Ties in length resolved by priorities
- Token specification order often defines priority
- REs priorities longest matching token rule
definition of a lexer
elsex 0
2
elsex
0
36Regular Expressions (RE)
- A language allows us to use a finite description
to specify a (possibly infinite) set - RE is the metalanguage used to define the token
types of a programming language
37Regular Expressions
- ? is a RE denoting L ?
- If a ? alphabet, then a is a RE denoting L a
- Suppose r and s are RE denoting L(r) and L(s)
- alternation (r) (s) is a RE denoting L(r) ?
L(s) - concatenation (r) (s) is a RE denoting
L(r)L(s) - repetition (r) is a RE denoting (L(r))
- (r) is a RE denoting L(r)
38Examples
- a b a, b
- (a b)(a b) aa, ab, ba, bb
- a ?, a, aa, aaa, ...
- (a b) the set of all strings of as and bs
- a ab the set containing the string a and
all strings consisting of zero or more as
followed by a b
39Specification of Patterns for Tokens Regular
Definitions
- Regular definitions introduce a naming
convention d1 ? r1 d2 ? r2 dn ? rn where
each ri is a regular expression over ? ? d1,
d2, , di-1 - Any dj in ri can be textually substituted in ri
to obtain an equivalent set of definitions
40Specification of Patterns for Tokens Regular
Definitions
- Exampleletter ? A?B??Z?a?b??z digit ?
0?1??9 id ? letter ( letter?digit ) - Regular definitions are not recursivedigits ?
digit digits?digit wrong!
41Regular Definitions Examples
Example 1. Identifiers are usually strings of
letters and digits beginning with a
letter letter ? A B . . . Z a b . . .
z digit ? 0 1 9 id ? letter (letter
digit)
42Regular Definitions Examples (Cont.)
Example 2. Numbers are usually strings such as
5230, 3.14, 6.45E4, 1.84E-4. digit ? 0 1
9 digits ? digit optional-fraction
?(.digits) ? optional-exponent ?(E( - ?)
digits) ? num ? digits optional-fraction
optional-exponent
43Regular Definitions and Grammars
stmt ? if expr then stmt ? if expr then
stmt else stmt ? ? expr ? term relop
term ? termterm ? id ? num
Grammar
Regular definitions
if ? if then ? then else ? elserelop
? lt ? lt ? ltgt ? gt ? gt ? id ? letter (
letter digit ) num ? digit (. digit)? ( E
(?-)? digit )?
44Finite Automata
- A finite automaton is a finite-state transition
diagram that can be used to model the recognition
of a token type specified by a regular expression - A finite automaton can be a nondeterministic
finite automaton or a deterministic finite
automaton
45Nondeterministic Finite Automata (NFA)
- An NFA consists of
- A finite set of states
- A finite set of input symbols
- A transition function that maps (state, symbol)
pairs to sets of states - A state distinguished as start state
- A set of states distinguished as final states
46Nondeterministic Finite Automata
- An NFA is a 5-tuple (S, ?, ?, s0, F) whereS is
a finite set of states? is a finite set of
symbols, the alphabet? is a mapping from S?? to
a set of statess0 ? S is the start stateF ? S
is the set of accepting (or final) states
47Transition Graph
- An NFA can be diagrammatically represented by a
labeled directed graph called a transition graph
a
S 0,1,2,3? a,bs0 0F 3
start
a
b
b
0
1
3
2
b
48Transition Table
- The mapping ? of an NFA can be represented in a
transition table
?(0,a) 0,1?(0,b) 0?(1,b) 2?(2,b)
3
49The Language Defined by an NFA
- An NFA accepts an input string x if and only if
there is some path with edges labeled with
symbols from x in sequence from the start state
to some accepting state in the transition graph - A state transition from one state to another on
the path is called a move - The language defined by an NFA is the set of
input strings it accepts, such as (a?b)abb for
the example NFA
50An Example
start
- RE (a b)abb
- States 1, 2, 3, 4
- Input symbols a, b
- Transition function(1,a) 1,2, (1,b)
1(2,b) 3, (3,b) 4 - Start state 1
- Final state 4
a,b
a
b
b
51Another Example
- RE aa bb
- States 1, 2, 3, 4, 5
- Input symbols a, b
- Transition function(1, ?) 2, 4, (2, a)
3, (3, a) 3,(4, b) 5, (5, b) 5 - Start state 1
- Final states 3, 5
52Acceptance of NFA
- An NFA accepts an input string s iff there is
some path in the finite-state transition diagram
from the start state to some final state such
that the edge labels along this path spell out s - The language recognized by an NFA is the set of
strings it accepts
53An Example
(a b)abb
aabb
a
a
b
b
start
1
4
2
3
b
54Finite-State Transition Diagram
aa bb
a
a
2
3
start
1
4
5
b
b
aaa
55Operations on NFA states
- ?-closure(S) set of states reachable from some
state s in S on ?-transitions alone - move(S, c) set of states to which there is a
transition on input symbol c from some state s in
S
56An Example
aa bb
a
S0 1 S1 ?-closure(1) 1,2,4 S2
move(1,2,4,a) 3 S3 ?-closure(3)
3 S4 move(3,a) 3 S5 ?-closure(3)
3 S6 move(3,a) 3 S7 ?-closure(3)
3 3 is in 3, 5 ? accept
a
2
3
start
1
4
5
b
b
aaa
57Simulating an NFA
Input An input string ended with eof and an NFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin S ?-closure(s0) c
nextchar while c ltgt eof do begin S
?-closure(move(S, c)) c nextchar end
if S ? F ltgt ? then return yes else return
no end.
58Computation of ?-closure
(a b)abb
a
4
3
start
a
b
b
11
10
1
2
8
9
7
b
?-closure(1) 1,2,3,5,8
5
6
?-closure(4) 2,3,4,5,7,8
59Computation of ?-closure
Input An NFA and a set of NFA states S. Output
T ?-closure(S). begin push all states in S
onto stack T S while stack is not empty
do begin pop t, the top element, off of
stack for each state u with an edge from t
to u labeled ? do if u is not in T then
begin add u to T push u onto
stack end end return T end.
60Deterministic Finite Automata (DFA)
- A DFA is a special case of an NFA in which
- no state has an ?-transition
- for each state s and input symbol a, there is at
most one edge labeled a leaving s
61An Example
- RE (a b)abb
- States 1, 2, 3, 4
- Input symbols a, b
- Transition function(1,a) 2, (2,a) 2, (3,a)
2, (4,a) 2(1,b) 1, (2,b) 3, (3,b) 4,
(4,b) 1 - Start state 1
- Final state 4
62Deterministic Finite Automata
- A deterministic finite automaton is a special
case of an NFA - No state has an ?-transition
- For each state s and input symbol a there is at
most one edge labeled a leaving s - Each entry in the transition table is a single
state - At most one path exists to accept a string
- Simulation algorithm is simple
63NFA and DFA
- NFA (Nondeterministic Finite Automaton)
- empty moves (reading e) with state change are
possible - ambiguous state transitions are possible,
- NFA accepts input string if there exists a
computation (i.e., a sequence of state
transitions) that leads to accept and halt - DFA (Deterministic Finite Automaton)
- No e-transitions, no ambiguous transitions (d is
a function)?? Special case of a NFA - The main difference is a Space Vs. Time tradeoff
- DFA are faster than NFA
- DFA are bigger (exponentially larger) that NFA.
64Example DFA
A DFA that accepts (a?b)abb
b
b
a
start
a
b
b
0
1
3
2
a
a
65Acceptance of DFA
- A DFA accepts an input string s iff there is one
path in the finite-state transition diagram from
the start state to some final state such that the
edge labels along this path spell out s - The language recognized by a DFA is the set of
strings it accepts
66An Example
(a b)abb
aabb
67An Example
bbababb s 1 s move(1, b) 1 s move(1,
b) 1 s move(1, a) 2 s move(2, b) 3 s
move(3, a) 2 s move(2, b) 3 s move(3, b)
4 4 is in 4 ? accept
68Finite-State Transition Diagram
69Simulating a DFA
Input An input string ended with eof and a DFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin s s0 c nextchar
while c ltgt eof do begin s move(s, c)
c nextchar end if s is in F then return
yes else return no end.
70Scanner Generators
71From a RE to an NFA
- Thompsons construction algorithm
- For ? , construct
- For a in alphabet, construct
start
a
f
i
72From a RE to an NFA
- Suppose N(s) and N(t) are NFA for RE s and t
- for s t, construct
- for s t, construct
is
fs
N(s)
start
f
i
it
ft
N(t)
fs
start
i
N(s)
N(t)
it
73From a RE to an NFA
- for s, construct
- for (s), use N(s)
start
is
fs
i
N(s)
74An Example
(a b)abb
75From an NFA to a DFA
Subset construction Algorithm. Input An NFA
N. Output A DFA D with states Dstates and
trasition table Dtran. begin add ?-closure(s0)
as an unmarked state to Dstates while there
is an unmarked state T in Dstates do begin
mark T for each input symbol a do begin
U ?-closure(move(T, a)) if U
is not in Dstates then add U as an
unmarked state to Dstates DtranT, a
U end end.
76An Example
(a b)abb
a
4
3
start
a
b
b
11
1
2
8
9
10
7
b
5
6
77An Example
?-closure(1) 1,2,3,5,8 A ?-closure(move(A,
a))?-closure(4,9) 2,3,4,5,7,8,9
B ?-closure(move(A, b))?-closure(6)
2,3,5,6,7,8 C ?-closure(move(B,
a))?-closure(4,9) B ?-closure(move(B,
b))?-closure(6,10) 2,3,5,6,7,8,10
D ?-closure(move(C, a))?-closure(4,9)
B ?-closure(move(C, b))?-closure(6)
C ?-closure(move(D, a))?-closure(4,9)
B ?-closure(move(D, b))?-closure(6,11)
2,3,5,6,7,8,11 E ?-closure(move(E,
a))?-closure(4,9) B ?-closure(move(E,
b))?-closure(6) C
78An Example
Input Symbol
State
a
b
A 1,2,3,5,8
B
C
B 2,3,4,5,7,8,9
B
D
C 2,3,5,6,7,8
B
C
D 2,3,5,6,7,8,10
B
E
E 2,3,5,6,7,8,11
B
C
79An Example
start
80Subset Construction Example 2
a
a1
2
1
?
start
?
a
b
b
a2
3
6
4
5
0
a
b
?
a3
8
7
b
b
DstatesA 0,1,3,7B 2,4,7C 8D
7E 5,8F 6,8
a3
C
a
b
b
b
start
A
D
a
a
b
b
B
E
F
a1
a3
a2 a3
81Combined Finite Automata
i
f
start
if
1
2
3
IF
ID
a-z
start
a-z,0-9
a-za-z0-9
1
2
REAL
0-9
.
(0-9.0-9) (0-9.0-9)
0-9
3
2
0-9
start
1
0-9
.
4
5
0-9
REAL
82Combined Finite Automata
i
f
2
3
4
IF
?
ID
a-z
start
a-z,0-9
5
6
1
?
?
REAL
0-9
.
0-9
9
8
0-9
7
0-9
.
10
11
0-9
NFA
REAL
83Combined Finite Automata
f
IF
ID
2
3
g-z
a-z,0-9
a-e
i
4
a-z,0-9
j-z
ID
0-9
start
a-h
1
REAL
0-9
.
6
5
0-9
.
0-9
7
8
0-9
DFA
REAL
84Combining the NFAs of a Set of Regular Expressions
start
a
2
1
a action1 abb action2 ab action3
start
a
b
b
3
6
4
5
a
b
start
8
7
b
a
2
1
?
start
?
a
b
b
3
6
4
5
0
a
b
?
8
7
b
85Flex A Scanner Generator
A language for specifying scanners
Flex compiler
lex.yy.c
lang.l
C compiler -lfl
a.out
lex.yy.c
a.out
tokens
source code
86Coding Regular Definitions in Transition Diagrams
relop ? lt?lt?ltgt?gt?gt?
start
lt
0
2
1
return(relop, LE)
gt
3
return(relop, NE)
other
4
return(relop, LT)
5
return(relop, EQ)
gt
6
7
return(relop, GE)
other
8
return(relop, GT)
id ? letter ( letter?digit )
letter or digit
start
letter
other
9
10
11
return(gettoken(), install_id())
87Flex Programs
auxiliary declarationsregular
definitionstranslation rulesauxiliary
procedures
88Translation Rules
P1 action1 P2 action2 ... Pn actionn
where Pi are regular expressions and actioni are
C program segments
89The Lex and Flex Scanner Generators
- Lex and its newer cousin flex are scanner
generators - Systematically translate regular definitions into
C source code for efficient scanning - Generated code is easy to integrate in C
applications