Lexical Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Lexical Analysis

Description:

Constructing Automaton from Specification. Create a non-deterministic automaton (NDFA) from every regular ... Construct a deterministic finite automaton (DFA) ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 27
Provided by: mooly
Category:

less

Transcript and Presenter's Notes

Title: Lexical Analysis


1
Lexical Analysis
  • Mooly Sagiv
  • msagiv_at_post.tau.ac.il
  • Schrierber 317
  • 03-640-7606
  • Wed 1000-1200
  • html//www.math.tau.ac.il/msagiv/courses/wcc.html
  • TextbookModern Compiler Implementation in C
  • Chapter 2

2
A motivating example
  • Create a program that counts the number of lines
    in a given input file

3
A motivating examplesolution
int num_lines 0 \n num_lines .
main()
yylex() printf( " of
lines d\n", num_lines)
4
Subjects
  • Roles of lexical analysis
  • The straightforward solution a manual scanner
    for C
  • Regular Expressions
  • Finite automata
  • From regular languages into finite automata
  • Flex

5
Basic Compiler Phases
Source program (string)
Finite automata
lexical analysis
Tokens
Pushdown automata
syntax analysis
Abstract syntax tree
semantic analysis
Memory organization
Translate
Intermediate representation
Instruction selection
Dynamic programming
Assembly
Register Allocation
graph algorithms
Fin. Assembly
6
Example
  • Input string
  • Tokens

a\b 5 3 \nb (print(a, a-1), 10 a)
\nprint(b)
id (a) assign num (5) num(3) id(b) assign
print(id(a) , id(a) - num(1)), num(10)
id(a)) print(id(b))
7
Lexical Analysis (Scanning)
  • Functionality
  • input
  • program text (file)
  • output
  • sequence of tokens
  • Read input file
  • Identify language keywords and standard
    identifiers
  • Handle include files and macros
  • Count line numbers
  • Remove whitespaces
  • Report illegal symbols
  • Produce symbol table

8
A simplified scanner for C
Token nextToken() char c loop c
getchar() switch (c) case goto loop
case return SemiColumn case c
getchar() switch (c)
case ' return PlusPlus
case ' return
PlusEqual default
putchar(c) return Plus
case lt case w
9
Automatic Generation of Lexical Analysis
  • The matching of input strings can be performed by
    a finite automaton
  • Examples
  • An automaton for while
  • An automaton for C identifier
  • An automaton for C comment
  • The program for the automaton is automatically
    generated from regular expressions

10
Flex
  • Input
  • regular expressions and actions (C code)
  • Output
  • A scanner program that reads the input and
    applies actions when input regular expression is
    matched

flex
11
Regular Expression Notations
a An ordinary character stands for itself MN M
or N MN M followed by N M Zero or more times of
M M One or more times of M M? Zero or one
occurrence of M a-zA-Z Character set
alternation (single character) . Any (single)
character but newline a. Quotation \ Convert
an operator into text
12
Ambiguity Resolving
  • Find the longest matching token
  • Between two tokens with the same length use the
    one declared first

13
A Flex specification of C Scanner
Letter a-zA-Z_ Digit 0-9 \t
\n line_count return
SemiColumn return PlusPlus
return PlusEqual return Plus while
return While Letter(LetterDigit)
return Id lt return LessOrEqual lt
return LessThen
14
Running Example
if return IF a-za-z0-9 return
ID 0-9 return NUM
0-9.0-90-9.0-9 return REAL
(\-\-a-z\n)( \n\t) .
error()
15
int edges256 / , 0, 1, 2, 3, ..., -, e,
f, g, h, i, j, ... / / state 0 / 0, ...,
0, 0, , 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0,
0 / state 1 / 13, ..., 7, 7, 7, 7, , 9,
4, 4, 4, 4, 2, 4, ..., 13, 13 / state 2 / 0,
, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ...,
0, 0 / state 3 / 0, , 4, 4, 4, 4, , 0,
4, 4, 4, 4, 4, 4, , 0, 0 / state 4 / 0, ,
4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0,
0 / state 5 / 0, , 6, 6, 6, 6, , 0, 0,
0, 0, 0, 0, 0, , 0, 0 / state 6 / 0, ,
6, 6, 6, 6, , 0, 0, 0, 0, 0, 0, 0, ..., 0,
0 / state 7 / ... / state 13 / 0, ,
0, 0, 0, 0, , 0, 0, 0, 0, 0, 0, 0, , 0, 0
16
Pseudo Code for Scanner
Token nextToken() lastFinal 0 currentState
1 inputPositionAtLastFinal input
currentPosition input while
(not(isDead(currentState))) nextState
edgescurrentStatecurrentPosition if
(isFinal(nextState)) lastFinal
nextState inputPositionAtLastFinal
currentPosition currentState nextState
advance currentPosition input
inputPositionAtLastFinal return
actionlastFinal
17
Example
Input if --not-a-com
18
Efficient Scanners
  • Efficient state representation
  • Input buffering
  • Using switch and goto instead of tables

19
Constructing Automaton from Specification
  • Create a non-deterministic automaton (NDFA) from
    every regular expression
  • Merge all the automata using epsilon moves(like
    the construction)
  • Construct a deterministic finite automaton (DFA)
  • Minimize the automaton starting with separate
    accepting states

20
NDFA Construction
if return IF a-za-z0-9 return
ID 0-9 return NUM
0-9.0-90-9.0-9 return REAL
(\-\-a-z\n)( \n\t) .
error()
21
DFA Construction
22
Minimization
23
/ C declarations / include tokens.h'' /
Mapping of tokens into integers / include
errormsg.h'' / Shared by all the phases
/ union int ival string sval double fval
yylval int charPos1 define ADJ
(EM_tokPoscharPos, charPosyyleng) / Lex
Definitions / digits 0-9 if ADJ
return IF a-za-z0-9 ADJ
yylval.svalString(yytext) return ID digits
ADJ yylval.ivalatoi(yytext) return NUM
(digits\.digits?)(digits?\.digits)
ADJ yylval.fvalatof(yytext) return REAL
(\-\-a-z\n)(\n\t" ") ADJ .
ADJ EM_error(illegal character'')
24
Start States
  • Regular expressions may be more complicated than
    automata
  • C comments
  • Solutions
  • Conversion of automata into regular expressions
  • Start States

start s1 s2 lt INITIALgtr1 action0 BEGIN
s_1 lts1gtr1 action1 BEGIN s2 lts2gtr2
action2 BEGIN INITIAL
25
Realistic Example
start Comment ltINITIALgt/'' BEGIN
Comment ltINITIALgtr1 Usual actions
ltINITIALgtr2 Usual actions
... ltINITIALgtrk Usual actions
ltCommentgt/ BEGIN Initial
ltCommentgt.\n
26
Summary
  • For most programming languages lexical analyzers
    can be easily constructed
  • Exceptions
  • Fortran
  • PL/1
  • Flex is a useful tool beyond compilers
Write a Comment
User Comments (0)
About PowerShow.com