Title: CS-338 Compiler Design
1CS-338Compiler Design
- Dr. Syed Noman Hasany
- Assistant Professor
- College of Computer, Qassim University
2Chapter 3 Lexical Analyzer
- THE ROLE OF LEXICAL ANALYSER
- It is the first phase of the compiler.
- It reads the input characters and produces as
output a sequence of tokens that the parser uses
for syntax analysis. - It strips out from the source program comments
and white spaces in the form of blank , tab and
newline characters . - It also correlates error messages from the
compiler with the source program (because it
keeps track of line numbers).
3Interaction of the Lexical Analyzer with the
Parser
Token,tokenval
LexicalAnalyzer
Parser
SourceProgram
Get nexttoken
error
error
Symbol Table
4The Reason Why Lexical Analysis is a Separate
Phase
- Simplifies the design of the compiler
- LL(1) or LR(1) parsing with 1 token lookahead
would not be possible (multiple characters/tokens
to match) - Provides efficient implementation
- Systematic techniques to implement lexical
analyzers by hand or automatically from
specifications - Stream buffering methods to scan input
- Improves portability
- Non-standard symbols and alternate character
encodings can be normalized (e.g. trigraphs)
5Attributes of Tokens
Lexical analyzer
y 31 28x
ltid, ygt ltassign, gt ltnum, 31gt lt, gt ltnum, 28gt
lt, gt ltid, xgt
token
Parser
tokenval(token attribute)
6Tokens, Patterns, and Lexemes
- A token is a classification of lexical units
- For example id and num
- Lexemes are the specific character strings that
make up a token - For example abc and 123
- Patterns are rules describing the set of lexemes
belonging to a token - For example letter followed by letters and
digits and non-empty sequence of digits
7Tokens, Patterns, and Lexemes
- A lexeme is a sequence of characters from the
source program that is matched by a pattern for a
token.
Token
lexeme
Pattern
8Tokens, Patterns, and Lexemes
93.2 Input Buffering
- Examining ways of speeding reading the source
program - In one buffer technique, the last lexeme under
process will be over-written when we reload the
buffer. - Two-buffer scheme handling large look ahead safely
103.2.1 Buffer Pairs
- Two buffers of the same size, say 4096, are
alternately reloaded. - Two pointers to the input are maintained
- Pointer lexeme_Begin marks the beginning of the
current lexeme. - Pointer forward scans ahead until a pattern match
is found.
11If forward at end of first half then begin
reload second half
forwardforward 1 End Else if forward at end
of second half then begin
reload first half move forward
to beginning of first half End Else
forwardforward 1
123.2.2 Sentinels
E M eof
C 2 eof eof
13forwardforward1 If forward EOF then begin
If forward at end of first half then begin
reload second half
forwardforward 1 End Else if forward at end
of second half then begin
reload first half move forward
to beginning of first half End Else terminate
lexical analysis
14Specification of Patterns for Tokens Definitions
- An alphabet ? is a finite set of symbols
(characters) - A string s is a finite sequence of symbols from ?
- ?s? denotes the length of string s
- ? denotes the empty string, thus ??? 0
- A language is a specific set of strings over some
fixed alphabet ?
15Specification of Patterns for Tokens String
Operations
- The concatenation of two strings x and y is
denoted by xy - The exponentation of a string s is defined
by s0 ? (Empty string a string of length
zero) si si-1s for i gt 0note that s? ?s
s
16Specification of Patterns for Tokens Language
Operations
- Union L ? M s ? s ? L or s ? M
- Concatenation LM xy ? x ? L and y ? M
- Exponentiation L0 ? Li Li-1L
- Kleene closure L ?i0,,? Li
- Positive closure L ?i1,,? Li
17Language Operations Examples
L A, B, C, D D 1, 2, 3
L ? D A, B, C, D, 1, 2, 3 LD A1, A2, A3,
B1, B2, B3, C1, C2, C3, D1, D2, D3 L2 AA,
AB, AC, AD, BA, BB, BC, BD, CA, DD L4 L2 L2
?? L All possible strings of L plus ?
L L - ? L (L ? D ) ?? L (L ? D ) ??
18Specification of Patterns for Tokens Regular
Expressions
- Basis symbols
- ? is a regular expression denoting language ?
- a ? ? is a regular expression denoting a
- If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then - r?s is a regular expression denoting L(r) ? M(s)
- rs is a regular expression denoting L(r)M(s)
- r is a regular expression denoting L(r)
- (r) is a regular expression denoting L(r)
- A language defined by a regular expression is
called a regular set
19- Examples
- let
- a b
- (a b) (a b)
- a
- (a b)
- a ab
- We assume that has the highest precedence and
is left associative. Concatenation has second
highest precedence and is left associative and
has the lowest precedence and is left
associative - (a) ((b)(c ) ) a bc
20Algebraic Properties of Regular Expressions
21Finite Automaton
- Given an input string, we need a machine that
has a regular expression hard-coded in it and can
tell whether the input string matches the pattern
described by the regular expression or not. - A machine that determines whether a given string
belongs to a language is called a finite
automaton.
22Deterministic Finite Automaton
- Definition Deterministic Finite Automaton
- a five-tuple (?, S, ?, s0, F) where
- ? is the alphabet
- S is the set of states
- ? is the transition function (S???S)
- s0 is the starting state
- F is the set of final states (F ? S)
- Notation
- Use a transition diagram to describe a DFA
- states are nodes, transitions are directed,
labeled edges, some states are marked as final,
one state is marked as starting - If the automaton stops at a final state on end of
input, then the input string belongs to the
language.
23? a
- ? a
- L a
- S 1,2
- ? (1,a)2
- S0 1
- F 2
24? ab
- ? a,b
- L a,b
- S 1,2
- ? (1,a)2, ? (1,b)2
- S0 1
- F 2
25? a(ab)
- ? a,b
- L aa,ab
- S 1,2,3
- ? (1,a)2, ? (2,a)3, ? (2,b)3
- S0 1
- F 3
26? a
- ? a
- L ?,a,aa,aaa,aaaa,
- S 1
- ? (1, ?)1, ? (1,a)1
- S0 1
- F 1
27?a?
- ? a
- L a,aa,aaa,aaaa,
- S 1,2
- ? (1,a)2, ? (2,a)2
- S0 1
- F 2
- Note a?aa
28? (ab)(ab)b
- ? a,b
- L aab,abb,bab,bbb
- S 1,2,3,4
- ?(1,a)2, ?(1,b)2, ?(2,a)3, ?(2,b)3,
- ?(3,b)4
- S0 1
- F 4
29? (ab)
- ? a,b
- L?,a,b,aa,bb,ba,ab,aaa,,bbb,,abab,,baba,bbba,
, - S 1
- ? (1,a)1, ? (1,b)1
- S0 1
- F 1
30? (ab)?
- ? a,b
- L a,aa,aaa,,b,bb,bbb,
- S 1,2
- ? (1,a)2, ? (1,b)2, ? (2,a)2, ? (2,b)2
- S0 1
- F 2
- Note (ab)?(ab)(ab)
31?a?b?
- ? a,b
- L a,aa,aaa,,b,bb,bbb,
- S 1,2,3
- ? (1,a)2, ? (2,a)2, ? (1,b)3, ? (3,b)3
- S0 1
- F 2,3
32?a(ab)
- ? a,b
- La,aa,ab,,aba,,abb,,baa,abbb,,bababa,
- S 1,2
- ? (1,a)2, ?(2,a)2, ?(2,b)2
- S0 1
- F 2
33?a(ba)b?
- ? a,b
- L aab,abb,aabb,,abbb,abbbb,
- S 1,2,3,4
- (1,a)2, ?(2,a)3, ?(2,b)3, ?(3,b)4,
- ?(4,b)4
- S0 1
- F 4
34? aba(a?b?)
- ? a,b
- L aaa,aab,abaa,abbaa,,abbab,abbabbb,
- S 1,2,3,4,5
- (1,a)2, ?(2,b)2, ?(2,a)3, ?(3,a)4, ?(4,a)4,
- (3,b)5, ?(5,b)5
- S0 1
- F 4,5
35Specification of Patterns for Tokens Regular
Definitions
- Regular definitions introduce a naming
convention d1 ? r1 d2 ? r2 dn ? rn where
each ri is a regular expression over ? ? d1,
d2, , di-1 - Any dj in ri can be textually substituted in ri
to obtain an equivalent set of definitions
36Specification of Patterns for Tokens Regular
Definitions
- Exampleletter ? A?B??Z?a?b??z digit ?
0?1??9 id ? letter ( letter?digit ) - Regular definitions are not recursivedigits ?
digit digits?digit wrong!
37Specification of Patterns for Tokens Notational
Shorthand
- The following shorthands are often used
r rr r? r?? a-z a?b?c??z - Examplesdigit ? 0-9num ? digit (. digit)?
( E (?-)? digit )?
38Regular Definitions and Grammars
Grammar
stmt ? if expr then stmt ? if expr then
stmt else stmt ? ? expr ? term relop
term ? termterm ? id ? num
Regular definitions
if ? if then ? then else ? elserelop
? lt ? lt ? ltgt ? gt ? gt ? id ? letter (
letter digit ) num ? digit (. digit)? ( E
(?-)? digit )?
39Constructing Transition Diagrams for Tokens
- Transition Diagrams (TD) are used to represent
the tokens these are automatons! - As characters are read, the relevant TDs are
used to attempt to match lexeme to a pattern - Each TD has
- States Represented by Circles
- Actions Represented by Arrows between states
- Start State Beginning of a pattern
(Arrowhead) - Final State(s) End of pattern (Concentric
Circles) - Each TD is Deterministic - No need to choose
between 2 different actions !
40Example All RELOPs
41Example TDs id and delim
Keyword or id
delim
42Combine TD for KW and IDs
- Install_id() decides for the attribute
- It will check the accepted lexeme in the list of
keywords if it is matched, zero is returned. - Otherwise checks the lexeme in symbol table, if
it is found, the address is returned. - If the lexeme not found in symbol table,
install_id() first installs the ID in the symbol
table and return the address of the newly created
entry. - Gettoken() decides for the token
- If zero returned by install_id(), the same
word(or its numeric form) is returned as token - Otherwise token ID is returned.
43Example TDs Unsigned s
Questions Is ordering important for unsigned
s ? Why are there no TDs
for then, else, if ?
44Keywords Recognition
All Keywords / Reserved words are matched as ids
- After the match, the symbol table or a special
keyword table is consulted - Keyword table contains string versions of all
keywords and associated token values
- If a match is not found, then it is assumed
that an id has been discovered
45Transition Diagrams Lexical Analyzers
state 0 token nexttoken() while(1)
switch (state) case 0 c
nextchar() / c is lookahead character
/ if (c blank ctab c
newline) state 0
lexeme_beginning / advance
beginning of lexeme / else
if (c lt) state 1 else if (c
) state 5 else if (c gt)
state 6 else state fail()
break / cases 1-8 here /
46 case 9 c nextchar() if
(isletter(c)) state 10 else state
fail() break case 10 c
nextchar() if (isletter(c)) state
10 else if (isdigit(c)) state 10
else state 11 break
case 11 retract(1) install_id()
return ( gettoken() ) / cases 12-24
here / case 25 c nextchar()
if (isdigit(c)) state 26 else state
fail() break case 26 c
nextchar() if (isdigit(c)) state
26 else state 27 break
case 27 retract(1) install_num()
return ( NUM )
Case numbers correspond to transition diagram
states !
47When Failures Occur
int state 0, start 0 Int lexical_value
/ to return second component of token / Init
fail() forward token_beginning
switch (start) case 0 start 9
break case 9 start 12 break
case 12 start 20 break case 20
start 25 break case 25 recover()
break default / compiler error /
return start
48Using a Lex Generator
Lex Compiler
- Lex source prog ?
? lex.yy.c - lex.l
- lex.yy.c ?
? a.out - Input stream ?
? sequence of input.c
tokens -
C compiler
a.out