Lexical Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Lexical Analysis

Description:

Lexical Analysis Compiler Baojian Hua bjhua_at_ustc.edu.cn Compiler Front and Back Ends Front End Lexical Analyzer The lexical analyzer translates the source program ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 63
Provided by: Baoji1
Category:
Tags: analysis | lexical

less

Transcript and Presenter's Notes

Title: Lexical Analysis


1
Lexical Analysis
  • Compiler
  • Baojian Hua
  • bjhua_at_ustc.edu.cn

2
Compiler
compiler
source program
target program
3
Front and Back Ends
front end
back end
source program
target program
IR
4
Front End
lexical analyzer
source code
tokens
abstract syntax tree
parser
semantic analyzer
IR
5
Lexical Analyzer
  • The lexical analyzer translates the source
    program into a stream of lexical tokens
  • Source program
  • stream of characters
  • vary from language to language (ASCII or Unicode,
    or )
  • Lexical token
  • compiler internal data structure that represents
    the occurrence of a terminal symbol
  • vary from compiler to compiler

6
Conceptually
lexical analyzer
character sequence
token sequence
7
Example
  • Recall the min-ML language in code3
  • prog -gt decs
  • decs -gt dec decs
  • dec -gt val id exp
  • val _ printInt exp
  • exp -gt id
  • num
  • exp exp
  • true
  • false
  • if (exp) then exp else exp
  • (exp)

8
Example
val x 3 val y 4 val z if (2) then
(x) else y val _ printInt z

lexical analysis
VAL IDENT(x) ASSIGN INT(3) SEMICOLON VAL
IDENT(y) ASSIGN INT(4) SEMICOLON VAL IDENT(z)
ASSIGN IF LPAREN INT(2) RPAREN THEN LPAREN
IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON VAL
UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF
9
Lexer Implementation
  • Options
  • Write a lexer by hand from scratch
  • boring, error-prone, and too much work
  • see dragon book sec3.4
  • Automatic lexer generator
  • Quick and easy

10
Lexer Implementation
declarative specification

lexical analyzer
11
Regular Expressions
  • How to specify a lexer?
  • Develop another language
  • Regular expressions
  • Whats a lexer-generator?
  • Another compiler

12
Basic Definitions
  • Alphabet the char set (say ASCII or Unicode)
  • String a finite sequence of char from alphabet
  • Language a set of strings
  • finite or infinite
  • say the C language

13
Regular Expression (RE)
  • Construction by induction
  • each c \in alphabet
  • a
  • empty \eps
  • for M and N, then MN
  • (ab) a, b
  • for M and N, then MN
  • (ab)(cd) ac, ad, bc, bd
  • for M, then M (Kleen closure)
  • (ab) \eps, a, aa, b, ab, abb, baa,

14
Regular Expression
  • Or more formally

e -gt c e e e e e
15
Example
  • Cs indentifier
  • starts with a letter (_ counts as a letter)
  • followed by zero or more of letter or digit
  • () ()
  • (_abzABZ) ()
  • (_abzABZ)(_abzABZ09)
  • (_abzABZ)(_abzABZ09)
  • Its really error-prone and tedious

16
Syntax Sugar
  • More syntax sugar
  • a-z abz
  • e one or more of e
  • e? zero or one of e
  • a a itself
  • ei, j more than i and less than j of e
  • . any char except \n
  • All these can be translated into core RE

17
Example Revisted
  • Cs indentifier
  • starts with a letter (_ counts as a letter)
  • followed by zero or more of letter or digit
  • () ()
  • (_abzABZ) ()
  • (_abzABZ)(_abzABZ09)
  • _a-zA-Z_a-zA-Z0-9
  • What about the key word if?

18
Ambiguous Rule
  • A single RE is not ambiguous
  • But in a language, there may be many REs?
  • _a-zA-Z_a-zA-Z0-9
  • if
  • So, for a string, which RE to match?

19
Ambiguous Rule
  • Two conventions
  • Longest match The regular expression that
    matches the longest string takes precedence.
  • Rule Priority The regular expressions
    identifying tokens are written down in sequence.
    If two regular expressions match the same
    (longest) string, the first regular expression in
    the sequence takes precedence.

20
Lexer Generator History
  • Lexical analysis was once a performance
    bottleneck
  • certainly not true today!
  • As a result, early research investigated methods
    for efficient lexical analysis
  • While the performance concerns are largely
    irrelevant today, the tools resulting from this
    research are still in wide use

21
History A long-standing goal
  • In this early period, a considerable amount of
    study went into the goal of creating an automatic
    compiler generator (aka compiler-compiler)

declarative compiler specification
compiler
22
History Unix and C
  • In the mid-1960s at Bell Labs, Ritchie and
    others were developing Unix
  • A key part of this project was the development of
    C and a compiler for it
  • Johnson, in 1968, proposed the use of finite
    state machines for lexical analysis and developed
    Lex CACM 11(12), 1968
  • read the accompanying paper on course page
  • Lex realized a part of the compiler-compiler goal
    by automatically generating fast lexical analyzers

23
The Lex tool
  • The original Lex generated lexers written in C (C
    in C)
  • Today every major language has its own lex
    tool(s)
  • sml-lex, ocamllex, JLex, Clex,
  • Our topic next
  • sml-lex
  • concepts and techniques apply to other tools

24
SML-Lex Specification
  • Lexical specification consists of 3 parts (yet
    another programming language)

User Declarations (plain SML types, values,
functions) SML-LEX Definitions (RE
abbreviations, special stuff) Rules
(association of REs with tokens) (each
token will be represented in plain SML)
25
User Declarations
  • User Declarations
  • User can define various values that are available
    to the action fragments.
  • Two values must be defined in this section
  • type lexresult
  • type of the value returned by each rule action.
  • fun eof ()
  • called by lexer when end of input stream is
    reached. (EOF)

26
SML-LEX Definitions
  • ML-LEX Definitions
  • User can define regular expression abbreviations
  • Define multiple lexers to work together. Each is
    given a unique name.

digits 0-9 letter a-zA-Z
s lex1 lex2 lex3
27
Rules
  • Rules
  • A rule consists of a pattern and an action
  • Pattern in a regular expression.
  • Action is a fragment of ordinary SML code.
  • Longest match rule priority used for
    disambiguation
  • Rules may be prefixed with the list of lexers
    that are allowed to use this rule.

ltlexerListgt regularExp gt (action)
28
Rules
  • Rule actions can use any value defined in the
    User Declarations section, including
  • type lexresult
  • type of value returned by each rule action
  • val eof unit -gt lexresult
  • called by lexer when end of input stream reached
  • special variables
  • yytext input substring matched by regular
    expression
  • yypos file position of the beginning of matched
    string
  • continue () doesnt return token recursively
    calls lexer

29
Example 1
  • ( A language called Toy )
  • prog -gt word prog
  • -gt
  • word -gt symbol
  • -gt number
  • symbol -gt _a-zA-Z_0-9a-zA-Z
  • number -gt 0-9

30
Example 1
  • ( Lexer Toy, see the accompany code for detail
    )
  • datatype token Symbol of string int
  • Number of string int
  • exception End
  • type lexresult unit
  • fun eof () raise End
  • fun output x
  • letter _a-zA-Z
  • digit 0-9
  • ld letterdigit
  • symbol letter ld
  • number digit
  • ltINITIALgtsymbol gt(output (Symbol(yytext,
    yypos)))
  • ltINITIALgtnumber gt(output (Number(yytext,
    yypos)))

31
Example 2
  • ( Expression Language
  • C-style comment, i.e. / /
  • )
  • prog -gt stms
  • stms -gt stm stms
  • -gt
  • stm -gt id e
  • -gt print e
  • e -gt id
  • -gt num
  • -gt e bop e
  • -gt (e)
  • bop -gt - /

32
Sample Program
  • x 4
  • y 5
  • z xy3
  • print z

33
Example 2
  • ( All terminals )
  • prog -gt stms
  • stms -gt stm stms
  • -gt
  • stm -gt id e
  • -gt print e
  • e -gt id
  • -gt num
  • -gt e bop e
  • -gt (e)
  • bop -gt - /

34
Example 2 in Lex
  • ( Expression language, see the accompany code
  • for detail.
  • Part 1 user code
  • )
  • datatype token
  • Id of string int
  • Number of string int
  • Print of string int
  • Plus of string int
  • ( all other stuffs )
  • exception End
  • type lexresult unit
  • fun eof () raise End
  • fun output x

35
Example 2 in Lex, cont
  • ( Expression language, see the accompany code
  • for detail.
  • Part 2 lex definition
  • )
  • letter _a-zA-Z
  • digit 0-9
  • ld letterdigit
  • sym letter ld
  • num digit
  • ws \ \t
  • nl \n

36
Example 2 in Lex, cont
  • ( Expression language, see the accompany code
  • for detail.
  • Part 3 rules
  • )
  • ltINITIALgtws gt(continue ())
  • ltINITIALgtnl gt(continue ())
  • ltINITIALgt gt(output (Plus (yytext, yypos)))
  • ltINITIALgt- gt(output (Minus (yytext, yypos)))
  • ltINITIALgt gt(output (Times (yytext, yypos)))
  • ltINITIALgt/ gt(output (Divide (yytext,
    yypos)))
  • ltINITIALgt( gt(output (Lparen (yytext,
    yypos)))
  • ltINITIALgt) gt(output (Rparen (yytext,
    yypos)))
  • ltINITIALgt gt(output (Assign (yytext,
    yypos)))
  • ltINITIALgt gt(output (Semi (yytext, yypos)))

37
Example 2 in Lex, cont
  • ( Expression language, see the accompany code
  • for detail.
  • Part 3 rules cont
  • )
  • ltINITIALgtprintgt(output (Print(yytext,
    yypos)))
  • ltINITIALgtsym gt(output (Id (yytext, yypos)))
  • ltINITIALgtnum gt(output (Number(yytext,
    yypos)))
  • ltINITIALgt/ gt (YYBEGIN COMMENT continue ())
  • ltCOMMENTgt/ gt (YYBEGIN INITIAL continue ())
  • ltCOMMENTgtnl gt (continue ())
  • ltCOMMENTgt. gt (continue ())
  • ltINITIALgt. gt (error ())

38
Lex Implementation
  • Lex accepts regular expressions (along with
    others)
  • So SML-lex is a compiler from RE to a lexer
  • Internal
  • RE ? NFA ? DFA ? table-driven alog

39
Finite-state Automata (FA)
M (?, S, q0, F, ?)
Transition function
Input alphabet
State set
Final states
Initial state
40
Transition functions
  • DFA
  • ? S ? ? ? S
  • NFA
  • ? S ? ? ? ?(S)

41
DFA example
  • Which strings of as and bs are accepted?
  • Transition function
  • (q0,a)?q1, (q0,b)?q0,
  • (q1,a)?q2, (q1,b)?q1,
  • (q2,a)?q2, (q2,b)?q2

42
NFA example
  • Transition function
  • (q0,a)?q0,q1, (q0,b)?q1, (q1,a)??,
    (q1,b)?q0,q1

43
RE -gt NFAThompson algorithm
  • Break RE down to atoms
  • construct small NFAs directly for atoms
  • inductively construct larger NFAs from small NFAs
  • Easy to implement
  • a small recursion algorithm

44
RE -gt NFAThompson algorithm
  • e -gt ?
  • -gt c
  • -gt e1 e2
  • -gt e1 e2
  • -gt e1

?

c

?
e2
e1



45
RE -gt NFAThompson algorithm
  • e -gt ?
  • -gt c
  • -gt e1 e2
  • -gt e1 e2
  • -gt e1

e1
?


?

?
?
e2


?
?
?
e1



?
46
Example
  • letter _a-zA-Z
  • digit 0-9
  • id letter (letterdigit)
  • ltINITIALgtif gt (IF (yytext, yypos))
  • ltINITIALgtid gt (Id (yytext, yypos))
  • ( Equivalent to
  • if id
  • )

47
Example
  • ltINITIALgtif gt (IF (yytext, yypos))
  • ltINITIALgtid gt (Id (yytext, yypos))

?
f
i
?





?
?


48
NFA -gt DFASubset construction algorithm
  • ( subset construction workList algorithm )
  • q0 lt- e-closure (n0)
  • Q lt- q0
  • workList lt- q0
  • while (workList ! \phi)
  • remove q from workList
  • foreach (character c)
  • t lt- ?-closure (move (q, c))
  • Dq, c lt- t
  • if (t\not\in Q)
  • add t to Q and workList

49
NFA -gt DFA?-closure
  • ( ?-closure fixpoint algorithm )
  • ( Dragon Fig 3.33 gives a DFS-like algorithm.
  • Here we give a recursive version. (Simpler)
  • )
  • X lt- \phi
  • fun eps (t)
  • X lt- X ? t
  • foreach (s \in one-eps(t))
  • if (s \not\in X)
  • then eps (s)

50
NFA -gt DFA ?-closure
  • ( ?-closure fixpoint algorithm )
  • ( dragon Fig 3.33 gives a DFS-like algorithm.
  • Here we give a recursive version. (Simpler)
  • )
  • fun e-closure (T)
  • X lt- T
  • foreach (t \in T)
  • X lt- X ? eps(t)

51
NFA -gt DFA ?-closure
  • ( ?-closure fixpoint algorithm )
  • ( A BFS-like algorithm.
  • )
  • X lt- empty
  • fun e-closure (T)
  • Q lt- T
  • X lt- T
  • while (Q not empty)
  • q lt- deQueue (Q)
  • foreach (s \in one-eps(q))
  • if (s \not\in X)
  • enQueue (Q, s)
  • X lt- X ? s

52
Example
  • ltINITIALgtif gt (IF (yytext, yypos))
  • ltINITIALgtid gt (Id (yytext, yypos))

?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
_a-zA-Z0-9
53
Example
  • q0 0, 1, 5 Q q0
  • Dq0, i 2, 3, 6, 7, 8 Q ? q1
  • Dq0, _ 6, 7, 8 Q ? q2
  • Dq1, f 4, 7, 8 Q ? q3

?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
f
q1
q3
i
_
_a-zA-Z0-9
q0
q2
54
Example
  • Dq1, _ 7, 8 Q ? q4
  • Dq2, _ 7, 8 Q
  • Dq3, _ 7, 8 Q
  • Dq4, _ 7, 8 Q

?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
q3
f
q1
_
_a-zA-Z0-9
i
_
_
_
q0
q4
_
q2
55
Example
  • q0 0, 1, 5 q1 2, 3, 6, 7, 8
  • q2 6, 7, 8 q3 4, 7, 8 q4 7, 8

?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
q3
f
ld
q1
_a-zA-Z0-9
i
ld-f
q4
q0
ld
ld
q2
letter-i
56
Example
  • q0 0, 1, 5 q1 2, 3, 6, 7, 8
  • q2 6, 7, 8 q3 4, 7, 8 q4 7, 8

?
f
i
?
1
2
3
0
?
_a-zA-Z
?
?
5
6
7
q3
f
ld
q1
_a-zA-Z0-9
i
ld-f
q4
q0
ld
ld
q2
letter-i
57
Table-driven Algorithm
  • Conceptually, an FA is a directed graph
  • Pragmatically, many different strategies to
    encode an FA
  • Matrix (adjacency matrix)
  • sml-lex
  • Array of list (adjacency list)
  • Hash table
  • Jump table (switch statements)
  • flex
  • Balance between time and space

58
Example
ltINITIALgtif gt (IF (yytext, yypos)) ltINITIALgti
d gt (Id (yytext, yypos))
state\char i f letter-i-f other
q0 q1 q2 q2 error
q1 q4 q3 q4 error
q2 q4 q4 q4 error
q3 q4 q4 q4 error
q4 q4 q4 q4 error
q3
f
ld
q1
state q0 q1 q2 q3 q4
action Id Id IF Id
i
ld-f
q4
q0
ld
ld
q2
letter-i
59
DFA MinimizationHopcrofts Algorithm
q3
f
ld
q1
i
ld-f
q4
q0
ld
ld
q2
letter-i
state q0 q1 q2 q3 q4
action Id Id IF Id
60
DFA MinimizationHopcrofts Algorithm
q3
f
ld
q1
i
ld-f
q4
q0
ld
ld
q2
letter-i
state q0 q1 q2 q3 q4
action Id Id IF Id
61
DFA MinimizationHopcrofts Algorithm
q3
f
q1
i
ld
ld-f
q0
q2, q4
letter-i
ld
state q0 q1 q2, q4 q3
action Id Id IF
62
Summary
  • A Lexer
  • input stream of characters
  • output stream of tokens
  • Writing lexers by hand is boring, so we use a
    lexer generator ml-lex
  • RE -gt NFA -gt DFA -gt table-driven algo
  • Moral dont underestimate your theory classes!
  • great application of cool theory developed in
    mathematics.
  • well see more cool apps as the course progresses
Write a Comment
User Comments (0)
About PowerShow.com