Title: Lexical Analysis
1Lexical Analysis
- Mooly Sagiv
- msagiv_at_post.tau.ac.il
- Schrierber 317
- 03-640-7606
- Wed 1000-1200
- html//www.math.tau.ac.il/msagiv/courses/wcc.html
- TextbookModern Compiler Implementation in C
- Chapter 2
2A motivating example
- Create a program that counts the number of lines
in a given input file
3A motivating examplesolution
int num_lines 0 \n num_lines .
main()
yylex() printf( " of
lines d\n", num_lines)
4Subjects
- Roles of lexical analysis
- The straightforward solution a manual scanner
for C - Regular Expressions
- Finite automata
- From regular languages into finite automata
- Flex
5Basic Compiler Phases
Source program (string)
Finite automata
lexical analysis
Tokens
Pushdown automata
syntax analysis
Abstract syntax tree
semantic analysis
Memory organization
Translate
Intermediate representation
Instruction selection
Dynamic programming
Assembly
Register Allocation
graph algorithms
Fin. Assembly
6Example
a\b 5 3 \nb (print(a, a-1), 10 a)
\nprint(b)
id (a) assign num (5) num(3) id(b) assign
print(id(a) , id(a) - num(1)), num(10)
id(a)) print(id(b))
7Lexical Analysis (Scanning)
- Functionality
- input
- program text (file)
- output
- sequence of tokens
- Read input file
- Identify language keywords and standard
identifiers - Handle include files and macros
- Count line numbers
- Remove whitespaces
- Report illegal symbols
- Produce symbol table
8A simplified scanner for C
Token nextToken() char c loop c
getchar() switch (c) case goto loop
case return SemiColumn case c
getchar() switch (c)
case ' return PlusPlus
case ' return
PlusEqual default
putchar(c) return Plus
case lt case w
9Automatic Generation of Lexical Analysis
- The matching of input strings can be performed by
a finite automaton - Examples
- An automaton for while
- An automaton for C identifier
- An automaton for C comment
- The program for the automaton is automatically
generated from regular expressions
10Flex
- Input
- regular expressions and actions (C code)
- Output
- A scanner program that reads the input and
applies actions when input regular expression is
matched
flex
11Regular Expression Notations
a An ordinary character stands for itself MN M
or N MN M followed by N M Zero or more times of
M M One or more times of M M? Zero or one
occurrence of M a-zA-Z Character set
alternation (single character) . Any (single)
character but newline a. Quotation \ Convert
an operator into text
12Ambiguity Resolving
- Find the longest matching token
- Between two tokens with the same length use the
one declared first
13A Flex specification of C Scanner
Letter a-zA-Z_ Digit 0-9 \t
\n line_count return
SemiColumn return PlusPlus
return PlusEqual return Plus while
return While Letter(LetterDigit)
return Id lt return LessOrEqual lt
return LessThen
14Running Example
if return IF a-za-z0-9 return
ID 0-9 return NUM
0-9.0-90-9.0-9 return REAL
(\-\-a-z\n)( \n\t) .
error()
15int edges256 / , 0, 1, 2, 3, ..., -, e,
f, g, h, i, j, ... / / state 0 / 0, ...,
0, 0, , 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0,
0 / state 1 / 13, ..., 7, 7, 7, 7, , 9,
4, 4, 4, 4, 2, 4, ..., 13, 13 / state 2 / 0,
, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ...,
0, 0 / state 3 / 0, , 4, 4, 4, 4, , 0,
4, 4, 4, 4, 4, 4, , 0, 0 / state 4 / 0, ,
4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0,
0 / state 5 / 0, , 6, 6, 6, 6, , 0, 0,
0, 0, 0, 0, 0, , 0, 0 / state 6 / 0, ,
6, 6, 6, 6, , 0, 0, 0, 0, 0, 0, 0, ..., 0,
0 / state 7 / ... / state 13 / 0, ,
0, 0, 0, 0, , 0, 0, 0, 0, 0, 0, 0, , 0, 0
16Pseudo Code for Scanner
Token nextToken() lastFinal 0 currentState
1 inputPositionAtLastFinal input
currentPosition input while
(not(isDead(currentState))) nextState
edgescurrentStatecurrentPosition if
(isFinal(nextState)) lastFinal
nextState inputPositionAtLastFinal
currentPosition currentState nextState
advance currentPosition input
inputPositionAtLastFinal return
actionlastFinal
17Example
Input if --not-a-com
18Efficient Scanners
- Efficient state representation
- Input buffering
- Using switch and goto instead of tables
19Constructing Automaton from Specification
- Create a non-deterministic automaton (NDFA) from
every regular expression - Merge all the automata using epsilon moves(like
the construction) - Construct a deterministic finite automaton (DFA)
- Minimize the automaton starting with separate
accepting states
20NDFA Construction
if return IF a-za-z0-9 return
ID 0-9 return NUM
0-9.0-90-9.0-9 return REAL
(\-\-a-z\n)( \n\t) .
error()
21DFA Construction
22Minimization
23 / C declarations / include tokens.h'' /
Mapping of tokens into integers / include
errormsg.h'' / Shared by all the phases
/ union int ival string sval double fval
yylval int charPos1 define ADJ
(EM_tokPoscharPos, charPosyyleng) / Lex
Definitions / digits 0-9 if ADJ
return IF a-za-z0-9 ADJ
yylval.svalString(yytext) return ID digits
ADJ yylval.ivalatoi(yytext) return NUM
(digits\.digits?)(digits?\.digits)
ADJ yylval.fvalatof(yytext) return REAL
(\-\-a-z\n)(\n\t" ") ADJ .
ADJ EM_error(illegal character'')
24Start States
- Regular expressions may be more complicated than
automata - C comments
- Solutions
- Conversion of automata into regular expressions
- Start States
start s1 s2 lt INITIALgtr1 action0 BEGIN
s_1 lts1gtr1 action1 BEGIN s2 lts2gtr2
action2 BEGIN INITIAL
25Realistic Example
start Comment ltINITIALgt/'' BEGIN
Comment ltINITIALgtr1 Usual actions
ltINITIALgtr2 Usual actions
... ltINITIALgtrk Usual actions
ltCommentgt/ BEGIN Initial
ltCommentgt.\n
26Summary
- For most programming languages lexical analyzers
can be easily constructed - Exceptions
- Fortran
- PL/1
- Flex is a useful tool beyond compilers