Lexical Analysis

About This Presentation

Title:

Lexical Analysis

Description:

Lexical analysis discards white spaces and comments between the tokens ... RE's priorities longest matching token rule = definition of a lexer. elsex = 0; ... – PowerPoint PPT presentation

Number of Views:455

Avg rating:3.0/5.0

Slides: 90

Provided by: CSIT4

Category:

more less

Transcript and Presenter's Notes

Title: Lexical Analysis

1
Lexical Analysis
2
Outline

Scanners
Tokens
Regular expressions
Finite automata
Automatic conversion from regular expressions to
finite automata

3
Lexical Analysis

Lexical analysis recognizes the vocabulary of the
programming language and transforms a string of
characters into a string of words or tokens
Lexical analysis discards white spaces and
comments between the tokens
Lexical analyzer (or scanner) is the program that
performs lexical analysis

4
The Reason Why Lexical Analysis is a Separate
Phase

Simplifies the design of the compiler
Provides efficient implementation
Systematic techniques to implement lexical
analyzers by hand or automatically from
specifications
Stream buffering methods to scan input
Improves portability
Non-standard symbols and alternate character
encodings can be normalized

5
Interaction of the Lexical Analyzer with the
Parser
Token,tokenval
LexicalAnalyzer
Parser
SourceProgram
Get nexttoken
error
error
Symbol Table
6
Lexical Function

1. Read the input stream (sequence of
characters), group the characters into primitives
(tokens).
Returns token as lttype, valuegt.
2. Throw out certain sequences of characters
(blanks, comments, etc.).
3. Build the symbol table, string table, constant
table, etc.
4. Generate error messages.
5. Convert, for example, string ? integer.
Tokens are described using regular expressions

7
Scanner (Lexical Analyzer)Main task

read the characters of the source language (a
stream of characters)
and break it up into tokens, the smallest
meaningful units of the source language.
Other tasks
- remove the comments,
- perform conversions (if needed),
- remove the white space,
- interpret the compiler directives,
- prepare the listing of the source program

8
Tokens, Patterns, and Lexemes

A token is a classification of lexical units
For example id and num
Lexemes are the specific character strings that
make up a token
For example abc and 123
Patterns are rules describing the set of lexemes
belonging to a token
For example letter followed by letters and
digits and non-empty sequence of digits

9
Lexical Analysis Process
Preprocessed source code, read char by char
if (b 0) a b
Lexical Analysis or Scanner
if
(
b

0
)
a

b

Lexical analysis - Transform multi-character
input stream to token stream - Reduce length of
program representation (remove spaces) -
10
Lexemes

Lexemes are the lowest level syntactic units.
Example
val (int)(xdot y0.3)
In the above statement, the lexemes are
val, , ( , int, ), (, xdot, , y, , 0.3,
), ,

11
Tokens

Tokens are the atomic unit of a language , and
are usually specific strings or instances of
classes of strings
The category of lexemes are tokens.
Identifiers Names chosen by the programmer.
val, xdot, y
Keywords Names chosen by the language designer
to help syntax and structure.
int, return, void.
(Keywords that cannot be used as identifiers are
known as RESERVED WORDS)

12
Tokens

A token is a sequence of characters that can be
treated as a unit in the grammar of a programming
language
A programming language classifies tokens into a
finite set of token types Type Examples ID foo
i n NUM 73 13 IF if COMMA ,

13
Tokens (Contd.)

Integers 2 1000 -20
Floating-point 2.0 -0.010 .02
Symbols _at_ ltlt gtgt
Strings x He said, I love G52CMP
Comments / Hi and Bye /

14
Tokens (Contd.)

Operators Identify actions.
, , !
Literals Denote values directly.
3.14, -10, a, true, null
Punctuation Symbols Supports syntactic
structure.
(, ), , ,

15
Tokens (Contd.)
Token types are usually divided into the
following classes (1) Variable-length tokens,
e.g. identifiers, numerical or string constants,
keywords (2) Fixed-length tokens (2a) simple
tokens, e.g., ,-, , , . . . (2b) compound
tokens e.g., lt, ! , , 46
16
Identifiers Vs. Keywords
Programming languages use fixed strings to
identify particular Keywords e.g., if, then,
else, etc. Since keywords are just identifiers
the Lexical Analyzer must distinguish between
these two possibilities. If keywords are
reservednot used as identifierswe can
initialize the symbol-table with all the keywords
and mark them as such. Then, a string is
recognized as an identifier only if it is not
already in the symbol-table as a keyword
17
Token Structure (Example)
18
Tokens

Identifiers x y11 elsex
Keywords if else while for break
Integers 2 1000 -20
Floating-point 2.0 -0.0010 .02 1e5
Symbols ltlt lt lt
Strings x He said, \I love EECS 483\
Comments / bla bla bla /

19
Semantic Values of Tokens

Semantic values are used to distinguish different
tokens in a token type
lt ID, foogt, lt ID, i gt, lt ID, n gt
lt NUM, 73gt, lt NUM, 13 gt
lt IF, gt
lt COMMA, gt
Token types affect syntax analysis and semantic
values affect semantic analysis

20
Attributes for Tokens
When a Token can be generated by different
Lexemes the Lexical Analyzer must transmit the
Lexeme to the subsequent phases of the
compiler. Such information is specified as an
Attribute associated to the Token. Usually, the
attribute of a Token is a pointer to the symbol
table entry that keeps information about the
Token. The Token influences parsing decisions
Parser relies on the token distinctions, e.g.,
identifiers are treated differently than keywords
The Attribute influences the translation phase.
21
Attributes for Tokens An Example

Example. Let us consider the following assignment
statement
E M C 2
Then the following pairs (token, attribute) are
passed to the Parser
(id, pointer to symbol-table entry for E)
(assign-op,)
(id, pointer to symbol-table entry for M)
(mult-op,)
(id, pointer to symbol-table entry for C)
(exp-op,)
(num, integer value 2).
Some Tokens have a null attribute the Token is
sufficient to identify the Lexeme.
From an implementation point of view, each
token is encoded as an integer number.

22
Semantic Values of Tokens

Example In a line of Java language
if (mark gt 80) grade A
tokens are
if ( mark gt 80 ) grade
A
The scanner is concerned with putting the tokens
together.
It does not check whether the tokens are in a
correct order

23
How to Describe Tokens

Use regular expressions to describe programming
language tokens!
A regular expression (RE) is defined inductively
a ordinary character stands for itself
? empty string
RS either R or S (alteration), where R,S RE
RS R followed by S (concatenation)
R concatenation of R 0 or mor times (Kleene
closure)

24
Lexical Analysis, How?

First, write down the lexical specification (how
each token is defined?)
using regular expression to specify the lexical
structure
identifier letter (letter digit
underscore)
letter a ... z A ... Z
digit 0 1 ... 9
Second, based on the above lexical specification,
build the lexical analyzer (to recognize tokens)
by hand,
Regular Expression Spec gt NFA gt DFA
gtTransition Table gt Lexical Analyzer
Or just by using lex --- the lexical analyzer
generator
Regular Expression Spec (in lex format) gt feed
to lex gt Lexical Analyzer

25
Scanner Generators
Scanner definition in matalanguage
Scanner Generator
Scanner
Program in programming language
Token types semantic values
Scanner
26
Languages

An Alphabet is a finite set of symbols.
Example Sb 0,1 binary alphabet
A language is a set of strings
L1 00,01,10,11 all strings of length 2
A string is a finite sequence of symbols taken
from a finite alphabet
10011 from Sb

27
Specification of Patterns for Tokens Regular
Expressions

Basis symbols
? is a regular expression denoting language ?
a ? ? is a regular expression denoting a
If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
r?s is a regular expression denoting L(r) ? M(s)
rs is a regular expression denoting L(r)M(s)
r is a regular expression denoting L(r)
(r) is a regular expression denoting L(r)
A language defined by a regular expression is
called a regular set

28
Specification of Patterns for Tokens Notational
Shorthand

The following shorthands are often used
r rr r? r?? a-z a?b?c??z
Examplesdigit ? 0-9num ? digit (. digit)?
( E (?-)? digit )?

29
Languages

The C language is the (infinite) set of all
strings that constitute legal C programs
The language of C reserved words is the (finite)
set of all alphabetic strings that cannot be used
as identifiers in the C programs
Each token type is a language

30
Languages

S denotes the set of all words over the alphabet
S.
s denotes the length of string
e denotes the word of length 0, the empty word.
? denotes the empty set, different from e
Note1 In language theory the terms sentence and
word are often used as synonyms for the term
string
Note2 A language (over alphabet S) is a set of
string (over alphabet S). For example S a
one possible language is L e, a aa aaa.

31
Examples

Length of a string
Example
10011 5
WHILE 5
WHILE 1
Empty string
e 0
Concatenation
Two strings x and y are joined together xy xy
Example
x AB, y CDE produce xy ABCDE
xy x y
xy ? yx (not commutative)
ex xe x
String exponentiation
x0 e
x1 x
x2 xx
xn xxn-1, n gt 1

32
Terms for parts of a string

TERM DEFINITION
prefix of s A string obtained by removing zero
or more trailing symbols of string s ban
is a prefix of banana.
suffix of s A string formed by deleting zero or
more of the leading symbols of s nana is
a suffix of banana.
substring of s A string obtained by deleting a
prefix and a suffix from s nan is a
substring of banana. Every prefix and every
suffix of s is a substring of s, but not every
substring of s is a prefix or a suffix of s. For
every string s, both s and e are prefixes,
suffixes, and substrings of s.
proper prefix, suffix,
or substring of s Any nonempty string x that is,
respectively, a prefix, suffix, or substring
of s such that s ? x.
subsequence of s Any string formed by deleting
zero or more not necessarily contiguous
symbols from s baaa is a subsequence of
banana.

33
Terms for parts of a string (examples)

Let us take this string banana
prefix
e, b, ba, ban, ..., banana
suffix
e, a, na, ana, ..., banana
substring e, b, a, n, ba, an, na, ..., banana
subsequence e, b, a, n, ba, bn, an, aa, na,
nn, ..., banana

34
Example

L is the set A, B, . . ., Z, a, b, . . . , z
and D the set 0, 1, . . . , 9. Since a symbol
can be regarded as a string of length one, the
sets L and D are each finite languages. The
following are some examples of new languages
created from L and D
1. L U D is the set of letters and digits.
2. LD is the set of strings consisting of a
letter followed by a digit.
3. L4 is the set of all four-letter strings.
4. L is the set of all strings of letters,
including e, the empty string.
5. L(L U D) is the set of all strings of letters
and digits beginning with a letter.
6. D is the set of all strings of one or more
digits.

35
How to Break up Text
1
else
x

0

REs alone not enough, need rule for choosing when
get multiple matches
Longest matching token wins
Ties in length resolved by priorities
Token specification order often defines priority
REs priorities longest matching token rule
definition of a lexer

elsex 0
2
elsex

0

36
Regular Expressions (RE)

A language allows us to use a finite description
to specify a (possibly infinite) set
RE is the metalanguage used to define the token
types of a programming language

37
Regular Expressions

? is a RE denoting L ?
If a ? alphabet, then a is a RE denoting L a
Suppose r and s are RE denoting L(r) and L(s)
alternation (r) (s) is a RE denoting L(r) ?
L(s)
concatenation (r) (s) is a RE denoting
L(r)L(s)
repetition (r) is a RE denoting (L(r))
(r) is a RE denoting L(r)

38
Examples

a b a, b
(a b)(a b) aa, ab, ba, bb
a ?, a, aa, aaa, ...
(a b) the set of all strings of as and bs
a ab the set containing the string a and
all strings consisting of zero or more as
followed by a b

39
Specification of Patterns for Tokens Regular
Definitions

Regular definitions introduce a naming
convention d1 ? r1 d2 ? r2 dn ? rn where
each ri is a regular expression over ? ? d1,
d2, , di-1
Any dj in ri can be textually substituted in ri
to obtain an equivalent set of definitions

40
Specification of Patterns for Tokens Regular
Definitions

Exampleletter ? A?B??Z?a?b??z digit ?
0?1??9 id ? letter ( letter?digit )
Regular definitions are not recursivedigits ?
digit digits?digit wrong!

41
Regular Definitions Examples
Example 1. Identifiers are usually strings of
letters and digits beginning with a
letter letter ? A B . . . Z a b . . .
z digit ? 0 1 9 id ? letter (letter
digit)
42
Regular Definitions Examples (Cont.)
Example 2. Numbers are usually strings such as
5230, 3.14, 6.45E4, 1.84E-4. digit ? 0 1
9 digits ? digit optional-fraction
?(.digits) ? optional-exponent ?(E( - ?)
digits) ? num ? digits optional-fraction
optional-exponent
43
Regular Definitions and Grammars
stmt ? if expr then stmt ? if expr then
stmt else stmt ? ? expr ? term relop
term ? termterm ? id ? num
Grammar
Regular definitions
if ? if then ? then else ? elserelop
? lt ? lt ? ltgt ? gt ? gt ? id ? letter (
letter digit ) num ? digit (. digit)? ( E
(?-)? digit )?
44
Finite Automata

A finite automaton is a finite-state transition
diagram that can be used to model the recognition
of a token type specified by a regular expression
A finite automaton can be a nondeterministic
finite automaton or a deterministic finite
automaton

45
Nondeterministic Finite Automata (NFA)

An NFA consists of
A finite set of states
A finite set of input symbols
A transition function that maps (state, symbol)
pairs to sets of states
A state distinguished as start state
A set of states distinguished as final states

46
Nondeterministic Finite Automata

An NFA is a 5-tuple (S, ?, ?, s0, F) whereS is
a finite set of states? is a finite set of
symbols, the alphabet? is a mapping from S?? to
a set of statess0 ? S is the start stateF ? S
is the set of accepting (or final) states

47
Transition Graph

An NFA can be diagrammatically represented by a
labeled directed graph called a transition graph

a
S 0,1,2,3? a,bs0 0F 3
start
a
b
b
0
1
3
2
b
48
Transition Table

The mapping ? of an NFA can be represented in a
transition table

?(0,a) 0,1?(0,b) 0?(1,b) 2?(2,b)
3
49
The Language Defined by an NFA

An NFA accepts an input string x if and only if
there is some path with edges labeled with
symbols from x in sequence from the start state
to some accepting state in the transition graph
A state transition from one state to another on
the path is called a move
The language defined by an NFA is the set of
input strings it accepts, such as (a?b)abb for
the example NFA

50
An Example
start

RE (a b)abb
States 1, 2, 3, 4
Input symbols a, b
Transition function(1,a) 1,2, (1,b)
1(2,b) 3, (3,b) 4
Start state 1
Final state 4

a,b
a
b
b
51
Another Example

RE aa bb
States 1, 2, 3, 4, 5
Input symbols a, b
Transition function(1, ?) 2, 4, (2, a)
3, (3, a) 3,(4, b) 5, (5, b) 5
Start state 1
Final states 3, 5

52
Acceptance of NFA

An NFA accepts an input string s iff there is
some path in the finite-state transition diagram
from the start state to some final state such
that the edge labels along this path spell out s
The language recognized by an NFA is the set of
strings it accepts

53
An Example
(a b)abb
aabb
a
a
b
b
start
1
4
2
3
b
54
Finite-State Transition Diagram
aa bb
a
a
2
3
start
1
4
5
b
b
aaa
55
Operations on NFA states

?-closure(S) set of states reachable from some
state s in S on ?-transitions alone
move(S, c) set of states to which there is a
transition on input symbol c from some state s in
S

56
An Example
aa bb
a
S0 1 S1 ?-closure(1) 1,2,4 S2
move(1,2,4,a) 3 S3 ?-closure(3)
3 S4 move(3,a) 3 S5 ?-closure(3)
3 S6 move(3,a) 3 S7 ?-closure(3)
3 3 is in 3, 5 ? accept
a
2
3
start
1
4
5
b
b
aaa
57
Simulating an NFA
Input An input string ended with eof and an NFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin S ?-closure(s0) c
nextchar while c ltgt eof do begin S
?-closure(move(S, c)) c nextchar end
if S ? F ltgt ? then return yes else return
no end.
58
Computation of ?-closure
(a b)abb
a
4
3
start
a
b
b
11
10
1
2
8
9
7
b
?-closure(1) 1,2,3,5,8
5
6
?-closure(4) 2,3,4,5,7,8
59
Computation of ?-closure
Input An NFA and a set of NFA states S. Output
T ?-closure(S). begin push all states in S
onto stack T S while stack is not empty
do begin pop t, the top element, off of
stack for each state u with an edge from t
to u labeled ? do if u is not in T then
begin add u to T push u onto
stack end end return T end.
60
Deterministic Finite Automata (DFA)

A DFA is a special case of an NFA in which
no state has an ?-transition
for each state s and input symbol a, there is at
most one edge labeled a leaving s

61
An Example

RE (a b)abb
States 1, 2, 3, 4
Input symbols a, b
Transition function(1,a) 2, (2,a) 2, (3,a)
2, (4,a) 2(1,b) 1, (2,b) 3, (3,b) 4,
(4,b) 1
Start state 1
Final state 4

62
Deterministic Finite Automata

A deterministic finite automaton is a special
case of an NFA
No state has an ?-transition
For each state s and input symbol a there is at
most one edge labeled a leaving s
Each entry in the transition table is a single
state
At most one path exists to accept a string
Simulation algorithm is simple

63
NFA and DFA

NFA (Nondeterministic Finite Automaton)
empty moves (reading e) with state change are
possible
ambiguous state transitions are possible,
NFA accepts input string if there exists a
computation (i.e., a sequence of state
transitions) that leads to accept and halt
DFA (Deterministic Finite Automaton)
No e-transitions, no ambiguous transitions (d is
a function)?? Special case of a NFA
The main difference is a Space Vs. Time tradeoff
DFA are faster than NFA
DFA are bigger (exponentially larger) that NFA.

64
Example DFA
A DFA that accepts (a?b)abb
b
b
a
start
a
b
b
0
1
3
2
a
a
65
Acceptance of DFA

A DFA accepts an input string s iff there is one
path in the finite-state transition diagram from
the start state to some final state such that the
edge labels along this path spell out s
The language recognized by a DFA is the set of
strings it accepts

66
An Example
(a b)abb
aabb
67
An Example
bbababb s 1 s move(1, b) 1 s move(1,
b) 1 s move(1, a) 2 s move(2, b) 3 s
move(3, a) 2 s move(2, b) 3 s move(3, b)
4 4 is in 4 ? accept
68
Finite-State Transition Diagram
69
Simulating a DFA
Input An input string ended with eof and a DFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin s s0 c nextchar
while c ltgt eof do begin s move(s, c)
c nextchar end if s is in F then return
yes else return no end.
70
Scanner Generators
71
From a RE to an NFA

Thompsons construction algorithm
For ? , construct
For a in alphabet, construct

start
a
f
i
72
From a RE to an NFA

Suppose N(s) and N(t) are NFA for RE s and t
for s t, construct
for s t, construct

is
fs
N(s)
start
f
i
it
ft
N(t)
fs
start
i
N(s)
N(t)
it
73
From a RE to an NFA

for s, construct
for (s), use N(s)

start
is
fs
i
N(s)
74
An Example
(a b)abb
75
From an NFA to a DFA
Subset construction Algorithm. Input An NFA
N. Output A DFA D with states Dstates and
trasition table Dtran. begin add ?-closure(s0)
as an unmarked state to Dstates while there
is an unmarked state T in Dstates do begin
mark T for each input symbol a do begin
U ?-closure(move(T, a)) if U
is not in Dstates then add U as an
unmarked state to Dstates DtranT, a
U end end.
76
An Example
(a b)abb
a
4
3
start
a
b
b
11
1
2
8
9
10
7
b
5
6
77
An Example
?-closure(1) 1,2,3,5,8 A ?-closure(move(A,
a))?-closure(4,9) 2,3,4,5,7,8,9
B ?-closure(move(A, b))?-closure(6)
2,3,5,6,7,8 C ?-closure(move(B,
a))?-closure(4,9) B ?-closure(move(B,
b))?-closure(6,10) 2,3,5,6,7,8,10
D ?-closure(move(C, a))?-closure(4,9)
B ?-closure(move(C, b))?-closure(6)
C ?-closure(move(D, a))?-closure(4,9)
B ?-closure(move(D, b))?-closure(6,11)
2,3,5,6,7,8,11 E ?-closure(move(E,
a))?-closure(4,9) B ?-closure(move(E,
b))?-closure(6) C
78
An Example
Input Symbol
State
a
b
A 1,2,3,5,8
B
C
B 2,3,4,5,7,8,9
B
D
C 2,3,5,6,7,8
B
C
D 2,3,5,6,7,8,10
B
E
E 2,3,5,6,7,8,11
B
C
79
An Example
start
80
Subset Construction Example 2
a
a1
2
1
?
start
?
a
b
b
a2
3
6
4
5
0
a
b
?
a3
8
7
b
b
DstatesA 0,1,3,7B 2,4,7C 8D
7E 5,8F 6,8
a3
C
a
b
b
b
start
A
D
a
a
b
b
B
E
F
a1
a3
a2 a3
81
Combined Finite Automata
i
f
start
if
1
2
3
IF
ID
a-z
start
a-z,0-9
a-za-z0-9
1
2
REAL
0-9
.
(0-9.0-9) (0-9.0-9)
0-9
3
2
0-9
start
1
0-9
.
4
5
0-9
REAL
82
Combined Finite Automata
i
f
2
3
4
IF
?
ID
a-z
start
a-z,0-9
5
6
1
?
?
REAL
0-9
.
0-9
9
8
0-9
7
0-9
.
10
11
0-9
NFA
REAL
83
Combined Finite Automata
f
IF
ID
2
3
g-z
a-z,0-9
a-e
i
4
a-z,0-9
j-z
ID
0-9
start
a-h
1
REAL
0-9
.
6
5
0-9
.
0-9
7
8
0-9
DFA
REAL
84
Combining the NFAs of a Set of Regular Expressions
start
a
2
1
a action1 abb action2 ab action3
start
a
b
b
3
6
4
5
a
b
start
8
7
b
a
2
1
?
start
?
a
b
b
3
6
4
5
0
a
b
?
8
7
b
85
Flex A Scanner Generator
A language for specifying scanners
Flex compiler
lex.yy.c
lang.l
C compiler -lfl
a.out
lex.yy.c
a.out
tokens
source code
86
Coding Regular Definitions in Transition Diagrams
relop ? lt?lt?ltgt?gt?gt?
start
lt

0
2
1
return(relop, LE)
gt
3
return(relop, NE)
other

4
return(relop, LT)

5
return(relop, EQ)
gt

6
7
return(relop, GE)
other

8
return(relop, GT)
id ? letter ( letter?digit )
letter or digit
start
letter

other
9
10
11
return(gettoken(), install_id())
87
Flex Programs
auxiliary declarationsregular
definitionstranslation rulesauxiliary
procedures
88
Translation Rules
P1 action1 P2 action2 ... Pn actionn
where Pi are regular expressions and actioni are
C program segments
89
The Lex and Flex Scanner Generators

Lex and its newer cousin flex are scanner
generators
Systematically translate regular definitions into
C source code for efficient scanning
Generated code is easy to integrate in C
applications

Write a Comment

User Comments (0)